Journal: Volume 27, No. 2, 2022
Pages: 43 – 52
DOI: https://doi.org/10.24025/2306-4412.2.2022.259408
953 Views

Methods and means of intelligent analysis of text documents

Dmytro Yakymenko, Yevheniia Kataieva
Received 15.01.2022
Revised 31.05.2022
Accepted 20.06.2022

Abstract

The paper reviews the methods of analysis and processing of electronic documents. Methods of analysis of text documents to solve the problem of determining the thematic affinity of texts are analyzed. An overview of existing approaches to solving the classification problem is performed. The main approaches used in the task of text classification are described; the stages of the classification process are determined and the most common methods of classifying text documents are considered. The main approaches to text pre-processing, such as: lower case, root correction, stemming, lemmatization, stop word removal, normalization, are considered. Advantages and disadvantages of each approach are considered. The procedure for reducing the dimension of a set of features with a division into sub-processes: selecting features and highlighting features is considered

Keywords

References

[1] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning. Adison Wesley, Reading, MA, 1998.

[2] N. Kasyanchuk, and L. Tkachuk, "Protection of information in databases", in Conf. VNTU of Electron. Sci. Publications, XLVIII Sci. and Tech. Conf. of the Faculty of Management and Information Security, 2019, pp. 2419-2424 [in Ukrainian].

[3] I. HWitten, E. Frank, and M. A. Hall, Data Mining: Practical Machine Learning Tools and Techniques3rd ed. Morgan Kaufmann, 2011.

[4] J. F. Luger, Artificial Intelligence. Strategies and methods for solving complex problems. 4th ed. Moscow: Izdat. Dom Williams, 2003.

[5] T. Joachims, Learning to Classify Text   Using Support Vector Machines: Methods, Theory and Algoritmhs. MA, USA: Kluwer Academic Publisher Norwel, 2002

[6] O. V. Havrylenko, Yu. O. Oliynyk, and G. V. Khanko, "Overview and analysis of text mining algorithms", Project Management, System Analysis and Logistics, no. 19, pp. 15-23, Kyiv, 2017 [in Ukrainian].

[7] M. Lemke, and G. Wiedemann, Text Mining in den Sozialwissenschaften. Springer Fachmedien Wiesbaden, 2016, pp. 397-419..

[8] I. V. Gushchin, and D. O. Sych, "Analysis of the influence of pre-processing of the text on the results of text classification", Young Scientist, no. 10, pp. 264-267, Kherson, 2018 [in Ukrainian].

[9] G. Salton et al., "Automatic text structuring and summarization", Information Processing & Management, vol. 33, no. 2, pp. 193-207, 1997.

[10] Z. Yao, Y. Sun, W. Ding, N. Rao, and H. Xiong, "Dynamic word embeddings for evolving semantic discovery", WSDM 2018 Proc. 11th ACM Int. Conf. on Web Search and Data Mining. Marina Del Rey, CA, USA, Febr. 5-9, 2018, pp. 673-681.

[11] Word2Vec Implementation. [Online]. Available: https://towardsdatascience.com/aword2vec-implementation-using-numpyand-python-d256cf0e5f28.

[12] T. Mikolov, K. Chen, G. Corrado, and J. Dean, "Efficient estimation of word  representations in vector space", arXiv:1301.3781, 2013.

[13] I. G. Oksanich, "Intellectual analysis of an array of text documents based on text  mining technology", Information Processing Systems, pp. 139-143, Lutsk, 2013 [in Ukrainian].

[14] A. Yu. Zubrytskyi, "Intellectual system of text research and analysis", M.S. thesis, National Technical University of Ukraine "Ihor Sikors'kyy Kyiv Polytechnic Institute, Kyiv, Ukraine, 2019 [in Ukrainian].

[15] G. S. Linoff, and M. J. A. Berry, Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management, 3rd ed. NY, USA: Wiley Publishing inc., 2011.

[16] S. Deerwester et al., Indexing by Latent Semantic Analysis. Chicago, IL, USA: Graduate Library School University of Chicago, 1990.

[17] E. V. Bodyansky, and O. G. Rudenko, Artificial Neural Networks: Architecture,  Training, Application. Kharkiv: TELETECH, 2004 [in Ukrainian].

[18] D. W. Lande, Search for Knowledge on the Internet. Professional Work. NY, USA: Williams, 2005.

[19] M. T. Hagan, H. B. Demuth, M. H. Beale, and O. De Jesús, Neural Network Design. 2014.

[20] K. S. Jones, "A statistical interpretation of term specificity and its application in retrieval", Journal of Documentation, vol. 60, no. 5, pp. 493-502, MCB University Press, 2004.

[21] A. Shalloway, and J. R. Trott, Design Templates. A New Approach to Object-Oriented Analysis and Design. NY, USA: Williams, 2002.

[22] "Library of software components of text analysis technology". [Online]. Available: https://www.analyst.ru/index.php?lang=rus &dir=content/downloads/.

[23] "Advego - content exchange №1". [Online]. Available: https://advego.com/.

[24] DeepDive [Online]. Available: http://deepdive.stanford.edu/.

[25] F. Pedregosa et al., "Scikit-learn: Machine learning in Python", Journal of Machine Learning Research, vol. 12, pp. 2825-2830, 2011

Suggested citation

Yakymenko, D., & Kataieva, Y. (2022). Methods and means of intelligent analysis of text documents . Bulletin of Cherkasy State Technological University, 27(2), 43-52. https://doi.org/10.24025/2306-4412.2.2022.259408