TABIIY TILNI QAYTA ISHLASH (NLP)DA SPACY MODULIDAN FOYDALANISH

Elov Botir Boltayevich

doi:https://dx.doi.org/10.36522/2181-9637-2022-4-5

743

Ushbu maqolada kompyuter lingvistikasining asosiy yoʻnalishlaridan biri hisoblangan tabiiy tilni qayta ishlash (NLP)da matnlarni Python dasturlash tilida yozilgan spaCy moduli arxitekturasi va vositalari koʻrib chiqiladi. Tabiiy tildagi matn alohida birlik (belgi)lardan iborat boʻlib, uni turli sathlarga mansub oʻzaro bogʻliq bir qancha qismlarga ajratish mumkin. Shunga muvofiq ravishda spaCy kutubxonasi vositalari yordamida matnni tokenizatsiyalash va pipeline jarayoni orqali hosil qilingan lemma, POS, tag, dep, shape, alpha va stop atributlaridan foydalanish usullari keltirilgan.

Jurnal nomiИЛМ-ФАН ВА ИННОВАЦИОН РИВОЖЛАНИШ
Nashr soniИлм-фан ва инновацион ривожланиш илмий журнали 2022 йил 4-сон
Ko'rishlar soni 743

Internet havola https://ilm.mininnovation.uz/index.php/journal/article/view/313

DOIhttps://dx.doi.org/10.36522/2181-9637-2022-4-5

UzSCI tizimida yaratilgan sana 17-02-2023

O'qishlar soni 620

Nashr sanasi 21-07-2022

Asosiy tilO'zbek

Sahifalar41-54

Kalit so'z

token

Python

tabiiy tilni qayta ishlash

NLP

spaCy

part-of-speech

lemmatizatsiya

parser

pipeline arxitekturasi

Ўзбек

Ushbu maqolada kompyuter lingvistikasining asosiy yoʻnalishlaridan biri hisoblangan tabiiy tilni qayta ishlash (NLP)da matnlarni Python dasturlash tilida yozilgan spaCy moduli arxitekturasi va vositalari koʻrib chiqiladi. Tabiiy tildagi matn alohida birlik (belgi)lardan iborat boʻlib, uni turli sathlarga mansub oʻzaro bogʻliq bir qancha qismlarga ajratish mumkin. Shunga muvofiq ravishda spaCy kutubxonasi vositalari yordamida matnni tokenizatsiyalash va pipeline jarayoni orqali hosil qilingan lemma, POS, tag, dep, shape, alpha va stop atributlaridan foydalanish usullari keltirilgan.

Kalit so'z

token

Python

tabiiy tilni qayta ishlash

NLP

spaCy

part-of-speech

lemmatizatsiya

parser

pipeline arxitekturasi

Русский

В данной статье рассматриваются проблемы обработки естественного языка (NLP), являющейся одной из основных областей компьютерной лингвистики, c инструментами модуля spaCy, написанного на языке Python. Текст на естественном языке состоит из отдельных единиц (символов) и может быть разделен на несколько взаимосвязанных частей, принадлежащих разным уровням. Соответственно, существуют способы токенизации текста с помощью инструментов библиотеки spaCy и использования атрибутов lemma, POS, tag, dep, shape, alpha и stop, сгенерированных конвейерным процессом.

Kalit so'z

части речи

лемматизация

Python

NLP

spaCy

обработка естественного языка

токенизация

синтаксический анализатор

конвейерная архитектура.

English

This article discusses the use and tools of the spaCy module, which is written in Python machine language, in the Natural Language Processing (NLP), considered as one of the main areas of computer linguistics. A text in a natural language contains separate units (symbols) and can be divided into several interrelated parts belonging to different levels. The article, therefore, presents methods for tokenizing text using the spaCy library tools as well as the lemma, POS, tag, dep, shape, alpha, and stop attributes generated in a pipeline process.

Kalit so'z

token

Python

NLP

spaCy

part-of-speech

parser

Natural language processing

lemmatization

pipeline architecture

№ Muallifning F.I.Sh. Lavozimi Tashkilot nomi

1 Elov B.B. “Kompyuter lingvistikasi va raqamli texnologiyalar” kafedrasi mudiri, dotsent, texnika bo‘yicha falsafa doktori (PhD) Alisher Navoiy nomidagi Toshkent davlat o‘zbek tili va adabiyoti universiteti

№ Havola nomi

1 GPT-3 Powers the NextGeneration of Apps. Available at: https://openai.com/blog/gpt-3-apps/.

2 Bolʹshakova Ye.I., Vorontsov K.V., Yefremova N.E., Klyshinskiy E.S., Lukashevich N.V., Sapin A.S. Avtomaticheskaya obrabotka tekstov na yestestvennom yazyke i analiz dannykh [Automatic natural language processing and data analysis]. Мoscow, NIU VShE Publ., 2017, 269 p.

3 Kharis M., Laksono K., Suhartono, Ridwan A., Mintowati, Yuniseffendri. Tokenization and lemmatization on German learning textbook level A1 of CEFR Standard. Journal of Higher Education Theory and Practice, 2022, no. 22 (1). DOI: 10.33423/jhetp.v22i1.4971/.

4 Chantrapornchai C., Tunsakul A. Information extraction on tourism domain using SpaCy and BERT. ECTI Transactions on Computer and Information Technology, 2021, 15 (1). DOI: 10.37936/ecticit.2021151.228621/.

5 Yanti R.M., Santoso I., Suadaa L.H. Application of named entity recognition via Twitter on SpaCy in Indonesian. Case Study: power failure in the special region of Yogyakarta. Indonesian Journal of Information Systems, 2021. DOI: 10.24002/ijis.v4i1.4677/.

6 Kharis M., Laksono K., Suhartono, Ridwan A., Mintowati, Yuniseffendri. Tokenization and lemmatization on german learning textbook level A1 of CEFR Standard. Journal of Higher Education Theory and Practice, 2022, no. 22 (1). DOI: 10.33423/jhetp.v22i1.4971/.

7 Cing D.L., Soe K.M. Improving accuracy of part-of-speech (POS) tagging using hidden markov model and morphological analysis for Myanmar language. International Journal of Electrical and Computer Engineering, 2020, no. 10 (2). DOI: 10.11591/ijece.v10i2. pp2023-2030/.

8 Chandola D., Garg A., Maurya A., Kushwaha A. Online Resume Parsing System Using Text Analytics, 2015. Available at: http://www.jmdet.com/wp-content/uploads/2015/08/CR9.pdf/.

9 Turgunbaev R., Elov B. The use of machine learning methods in the automatic extraction of metadata from academic articles. International Journal of Innovations in Engineering Research and Technology, 2021, no. 8 (12), pp. 72-79. DOI: 10.17605/OSF.IO/QB5PZ/.

10 Elov B., Akhmedova Kh. A mathematical model that semantically analyzes polysemantic words. Journal of Pedagogical Inventions and Practices, 2021, no. 3, pp. 119-122. Available at: https:// zienjournals.com/index.php/jpip/article/view/469/.

11 Jabeen H. Stemming and lemmatization in Python. Towardsdatascience, 2018.

12 Chong C., Sheikh U.U., Samah N.A., Sha’Ameri A.Z. Analysis on reflective writing using natural language processing and sentiment analysis. IOP Conference Series: Materials Science and Engineering, 2020, no. 884 (1). DOI: 10.1088/1757-899X/884/1/012069/.

13 Honnibal M., Montani I. SpaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Appear, 2017, no. 7 (1), pp. 411-420. Available at: https://sentometrics-research.com/publication/72/.

14 Shelar H., Kaur G., Heda N., Agrawal P. Named entity recognition approaches and their comparison for custom NER Model. Science and Technology Libraries, 2020, no. 39 (3), pp. 324-337. DOI: 10.1080/0194262X.2020.1759479/.

15 Jugran S., Kumar A., Tyagi B.S., Anand V. Extractive automatic text summarization using SpaCy in Python NLP. 2021 International Conference on Advance Computing and Innovative Technologies in Engineering, ICACITE, 2021. DOI: 10.1109/ICACITE51222.2021.9404712/.

16 Honnibal M. Founder and CTO, SpaCy.io. Available at: http://scholar.google.com/ citations?user=FXwlnmAAAAAJ&hl=en/.

17 Ines, a software developer working on Artificial Intelligence and Natural Language Processing technologies, and the co-founder and CEO of Explosion. Available at: https://ines.io/.

18 Saloot M. A., Pham D.N. Real-time Text Stream Processing: A Dynamic and Distributed NLP Pipeline. ACM International Conference Proceeding Series. 2021. DOI: 10.1145/3459104.3459198/.

19 Rai A., Borah S. Study of various methods for tokenization. Lecture Notes in Networks and Systems, 2021, vol. 137. DOI: 10.1007/978-981-15-6198-6_18/.

20 Pudasaini S., Shakya S., Lamichhane S., Adhikari S., Tamang A., Adhikari S. Application of NLP for information extraction from unstructured documents. Lecture Notes in Networks and Systems, 2022, vol. 209. DOI: 10.1007/978-981-16-2126-0_54/.

21 Pota M., Marulli F., Esposito M., de Pietro G., Fujita H. Multilingual POS tagging by a composite deep architecture based on character-level features and on-the-fly enriched Word Embeddings. Knowledge-Based Systems, 2019, vol. 164. DOI: 10.1016/j.knosys.2018.11.003/.

22 Kumar A., Katiyar V., Kumar P. A Comparative analysis of pre-processing time in summary of hindi language using Stanza and Spacy. IOP Conference Series: Materials Science and Engineering, 2021, no. 1110 (1). DOI: 10.1088/1757-899x/1110/1/012019/.

Kutilmoqda

№	Havola nomi
1	GPT-3 Powers the NextGeneration of Apps. Available at: https://openai.com/blog/gpt-3-apps/.
2	Bolʹshakova Ye.I., Vorontsov K.V., Yefremova N.E., Klyshinskiy E.S., Lukashevich N.V., Sapin A.S. Avtomaticheskaya obrabotka tekstov na yestestvennom yazyke i analiz dannykh [Automatic natural language processing and data analysis]. Мoscow, NIU VShE Publ., 2017, 269 p.
3	Kharis M., Laksono K., Suhartono, Ridwan A., Mintowati, Yuniseffendri. Tokenization and lemmatization on German learning textbook level A1 of CEFR Standard. Journal of Higher Education Theory and Practice, 2022, no. 22 (1). DOI: 10.33423/jhetp.v22i1.4971/.
4	Chantrapornchai C., Tunsakul A. Information extraction on tourism domain using SpaCy and BERT. ECTI Transactions on Computer and Information Technology, 2021, 15 (1). DOI: 10.37936/ecticit.2021151.228621/.
5	Yanti R.M., Santoso I., Suadaa L.H. Application of named entity recognition via Twitter on SpaCy in Indonesian. Case Study: power failure in the special region of Yogyakarta. Indonesian Journal of Information Systems, 2021. DOI: 10.24002/ijis.v4i1.4677/.
6	Kharis M., Laksono K., Suhartono, Ridwan A., Mintowati, Yuniseffendri. Tokenization and lemmatization on german learning textbook level A1 of CEFR Standard. Journal of Higher Education Theory and Practice, 2022, no. 22 (1). DOI: 10.33423/jhetp.v22i1.4971/.
7	Cing D.L., Soe K.M. Improving accuracy of part-of-speech (POS) tagging using hidden markov model and morphological analysis for Myanmar language. International Journal of Electrical and Computer Engineering, 2020, no. 10 (2). DOI: 10.11591/ijece.v10i2. pp2023-2030/.
8	Chandola D., Garg A., Maurya A., Kushwaha A. Online Resume Parsing System Using Text Analytics, 2015. Available at: http://www.jmdet.com/wp-content/uploads/2015/08/CR9.pdf/.
9	Turgunbaev R., Elov B. The use of machine learning methods in the automatic extraction of metadata from academic articles. International Journal of Innovations in Engineering Research and Technology, 2021, no. 8 (12), pp. 72-79. DOI: 10.17605/OSF.IO/QB5PZ/.
10	Elov B., Akhmedova Kh. A mathematical model that semantically analyzes polysemantic words. Journal of Pedagogical Inventions and Practices, 2021, no. 3, pp. 119-122. Available at: https:// zienjournals.com/index.php/jpip/article/view/469/.
11	Jabeen H. Stemming and lemmatization in Python. Towardsdatascience, 2018.
12	Chong C., Sheikh U.U., Samah N.A., Sha’Ameri A.Z. Analysis on reflective writing using natural language processing and sentiment analysis. IOP Conference Series: Materials Science and Engineering, 2020, no. 884 (1). DOI: 10.1088/1757-899X/884/1/012069/.
13	Honnibal M., Montani I. SpaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. Appear, 2017, no. 7 (1), pp. 411-420. Available at: https://sentometrics-research.com/publication/72/.
14	Shelar H., Kaur G., Heda N., Agrawal P. Named entity recognition approaches and their comparison for custom NER Model. Science and Technology Libraries, 2020, no. 39 (3), pp. 324-337. DOI: 10.1080/0194262X.2020.1759479/.
15	Jugran S., Kumar A., Tyagi B.S., Anand V. Extractive automatic text summarization using SpaCy in Python NLP. 2021 International Conference on Advance Computing and Innovative Technologies in Engineering, ICACITE, 2021. DOI: 10.1109/ICACITE51222.2021.9404712/.
16	Honnibal M. Founder and CTO, SpaCy.io. Available at: http://scholar.google.com/ citations?user=FXwlnmAAAAAJ&hl=en/.
17	Ines, a software developer working on Artificial Intelligence and Natural Language Processing technologies, and the co-founder and CEO of Explosion. Available at: https://ines.io/.
18	Saloot M. A., Pham D.N. Real-time Text Stream Processing: A Dynamic and Distributed NLP Pipeline. ACM International Conference Proceeding Series. 2021. DOI: 10.1145/3459104.3459198/.
19	Rai A., Borah S. Study of various methods for tokenization. Lecture Notes in Networks and Systems, 2021, vol. 137. DOI: 10.1007/978-981-15-6198-6_18/.
20	Pudasaini S., Shakya S., Lamichhane S., Adhikari S., Tamang A., Adhikari S. Application of NLP for information extraction from unstructured documents. Lecture Notes in Networks and Systems, 2022, vol. 209. DOI: 10.1007/978-981-16-2126-0_54/.
21	Pota M., Marulli F., Esposito M., de Pietro G., Fujita H. Multilingual POS tagging by a composite deep architecture based on character-level features and on-the-fly enriched Word Embeddings. Knowledge-Based Systems, 2019, vol. 164. DOI: 10.1016/j.knosys.2018.11.003/.
22	Kumar A., Katiyar V., Kumar P. A Comparative analysis of pre-processing time in summary of hindi language using Stanza and Spacy. IOP Conference Series: Materials Science and Engineering, 2021, no. 1110 (1). DOI: 10.1088/1757-899x/1110/1/012019/.