Ushbu maqolada kompyuter lingvistikasining asosiy yoʻnalishlaridan biri hisoblangan tabiiy tilni qayta ishlash (NLP)da matnlarni Python dasturlash tilida yozilgan spaCy moduli arxitekturasi va vositalari koʻrib chiqiladi. Tabiiy tildagi matn alohida birlik (belgi)lardan iborat boʻlib, uni turli sathlarga mansub oʻzaro bogʻliq bir qancha qismlarga ajratish mumkin. Shunga muvofiq ravishda spaCy kutubxonasi vositalari yordamida matnni tokenizatsiyalash va pipeline jarayoni orqali hosil qilingan lemma, POS, tag, dep, shape, alpha va stop atributlaridan foydalanish usullari keltirilgan.
Ushbu maqolada kompyuter lingvistikasining asosiy yoʻnalishlaridan biri hisoblangan tabiiy tilni qayta ishlash (NLP)da matnlarni Python dasturlash tilida yozilgan spaCy moduli arxitekturasi va vositalari koʻrib chiqiladi. Tabiiy tildagi matn alohida birlik (belgi)lardan iborat boʻlib, uni turli sathlarga mansub oʻzaro bogʻliq bir qancha qismlarga ajratish mumkin. Shunga muvofiq ravishda spaCy kutubxonasi vositalari yordamida matnni tokenizatsiyalash va pipeline jarayoni orqali hosil qilingan lemma, POS, tag, dep, shape, alpha va stop atributlaridan foydalanish usullari keltirilgan.
В данной статье рассматриваются проблемы обработки естественного языка (NLP), являющейся одной из основных областей компьютерной лингвистики, c инструментами модуля spaCy, написанного на языке Python. Текст на естественном языке состоит из отдельных единиц (символов) и может быть разделен на несколько взаимосвязанных частей, принадлежащих разным уровням. Соответственно, существуют способы токенизации текста с помощью инструментов библиотеки spaCy и использования атрибутов lemma, POS, tag, dep, shape, alpha и stop, сгенерированных конвейерным процессом.
This article discusses the use and tools of the spaCy module, which is written in Python machine language, in the Natural Language Processing (NLP), considered as one of the main areas of computer linguistics. A text in a natural language contains separate units (symbols) and can be divided into several interrelated parts belonging to different levels. The article, therefore, presents methods for tokenizing text using the spaCy library tools as well as the lemma, POS, tag, dep, shape, alpha, and stop attributes generated in a pipeline process.
№ | Муаллифнинг исми | Лавозими | Ташкилот номи |
---|---|---|---|
1 | Elov B.B. | “Kompyuter lingvistikasi va raqamli texnologiyalar” kafedrasi mudiri, dotsent, texnika bo‘yicha falsafa doktori (PhD) | Alisher Navoiy nomidagi Toshkent davlat o‘zbek tili va adabiyoti universiteti |
№ | Ҳавола номи |
---|---|
1 | GPT-3 Powers the NextGeneration of Apps. Available at: https://openai.com/blog/gpt-3-apps/. |
2 | Bolʹshakova Ye.I., Vorontsov K.V., Yefremova N.E., Klyshinskiy E.S., Lukashevich N.V., Sapin A.S. Avtomaticheskaya obrabotka tekstov na yestestvennom yazyke i analiz dannykh [Automatic natural language processing and data analysis]. Мoscow, NIU VShE Publ., 2017, 269 p. |
3 | Kharis M., Laksono K., Suhartono, Ridwan A., Mintowati, Yuniseffendri. Tokenization and lemmatization on German learning textbook level A1 of CEFR Standard. Journal of Higher Education Theory and Practice, 2022, no. 22 (1). DOI: 10.33423/jhetp.v22i1.4971/. |
4 | Chantrapornchai C., Tunsakul A. Information extraction on tourism domain using SpaCy and BERT. ECTI Transactions on Computer and Information Technology, 2021, 15 (1). DOI: 10.37936/ecticit.2021151.228621/. |
5 | Yanti R.M., Santoso I., Suadaa L.H. Application of named entity recognition via Twitter on SpaCy in Indonesian. Case Study: power failure in the special region of Yogyakarta. Indonesian Journal of Information Systems, 2021. DOI: 10.24002/ijis.v4i1.4677/. |
6 | Kharis M., Laksono K., Suhartono, Ridwan A., Mintowati, Yuniseffendri. Tokenization and lemmatization on german learning textbook level A1 of CEFR Standard. Journal of Higher Education Theory and Practice, 2022, no. 22 (1). DOI: 10.33423/jhetp.v22i1.4971/. |
7 | Cing D.L., Soe K.M. Improving accuracy of part-of-speech (POS) tagging using hidden markov model and morphological analysis for Myanmar language. International Journal of Electrical and Computer Engineering, 2020, no. 10 (2). DOI: 10.11591/ijece.v10i2. pp2023-2030/. |
8 | Chandola D., Garg A., Maurya A., Kushwaha A. Online Resume Parsing System Using Text Analytics, 2015. Available at: http://www.jmdet.com/wp-content/uploads/2015/08/CR9.pdf/. |
9 | Turgunbaev R., Elov B. The use of machine learning methods in the automatic extraction of metadata from academic articles. International Journal of Innovations in Engineering Research and Technology, 2021, no. 8 (12), pp. 72-79. DOI: 10.17605/OSF.IO/QB5PZ/. |
10 | Elov B., Akhmedova Kh. A mathematical model that semantically analyzes polysemantic words. Journal of Pedagogical Inventions and Practices, 2021, no. 3, pp. 119-122. Available at: https:// zienjournals.com/index.php/jpip/article/view/469/. |
11 | Jabeen H. Stemming and lemmatization in Python. Towardsdatascience, 2018. |
12 | Chong C., Sheikh U.U., Samah N.A., Sha’Ameri A.Z. Analysis on reflective writing using natural language processing and sentiment analysis. IOP Conference Series: Materials Science and Engineering, 2020, no. 884 (1). DOI: 10.1088/1757-899X/884/1/012069/. |
13 | Honnibal M., Montani I. SpaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Appear, 2017, no. 7 (1), pp. 411-420. Available at: https://sentometrics-research.com/publication/72/. |
14 | Shelar H., Kaur G., Heda N., Agrawal P. Named entity recognition approaches and their comparison for custom NER Model. Science and Technology Libraries, 2020, no. 39 (3), pp. 324-337. DOI: 10.1080/0194262X.2020.1759479/. |
15 | Jugran S., Kumar A., Tyagi B.S., Anand V. Extractive automatic text summarization using SpaCy in Python NLP. 2021 International Conference on Advance Computing and Innovative Technologies in Engineering, ICACITE, 2021. DOI: 10.1109/ICACITE51222.2021.9404712/. |
16 | Honnibal M. Founder and CTO, SpaCy.io. Available at: http://scholar.google.com/ citations?user=FXwlnmAAAAAJ&hl=en/. |
17 | Ines, a software developer working on Artificial Intelligence and Natural Language Processing technologies, and the co-founder and CEO of Explosion. Available at: https://ines.io/. |
18 | Saloot M. A., Pham D.N. Real-time Text Stream Processing: A Dynamic and Distributed NLP Pipeline. ACM International Conference Proceeding Series. 2021. DOI: 10.1145/3459104.3459198/. |
19 | Rai A., Borah S. Study of various methods for tokenization. Lecture Notes in Networks and Systems, 2021, vol. 137. DOI: 10.1007/978-981-15-6198-6_18/. |
20 | Pudasaini S., Shakya S., Lamichhane S., Adhikari S., Tamang A., Adhikari S. Application of NLP for information extraction from unstructured documents. Lecture Notes in Networks and Systems, 2022, vol. 209. DOI: 10.1007/978-981-16-2126-0_54/. |
21 | Pota M., Marulli F., Esposito M., de Pietro G., Fujita H. Multilingual POS tagging by a composite deep architecture based on character-level features and on-the-fly enriched Word Embeddings. Knowledge-Based Systems, 2019, vol. 164. DOI: 10.1016/j.knosys.2018.11.003/. |
22 | Kumar A., Katiyar V., Kumar P. A Comparative analysis of pre-processing time in summary of hindi language using Stanza and Spacy. IOP Conference Series: Materials Science and Engineering, 2021, no. 1110 (1). DOI: 10.1088/1757-899x/1110/1/012019/. |