TECHNOLOGICAL BASIS OF “INDUSTRY 4.0”
ALB-Stanza: A Stanza-based parser for the Albanian Language
- 1 Polytechnic University of Tirana, Albania
- 2 University of Tirana, Albania
Abstract
Fundamental tasks in Natural Language Processing include Part-of-Speech tagging, lemmatization, and dependency parsing, which provide crucial linguistic information essential for a wide range of NLP applications. POS tagging assigns each word in a sentence its grammatical category. Lemmatization identifies the dictionary form of each word, considering its contextual usage within the sentence. Dependency parsing, on the other hand, determines the structural relationships between words, producing dependency trees that capture the grammatical organization of sentences. In this paper, we introduce ALB-Stanza, a neural pipeline parser designed for sentence segmentation, tokenization, POS tagging, morphological feature annotation, lemmatization, and dependency parsing specifically for the Albanian language. To train the ALB-Stanza parser model, we used our own corpora annotated according to the Universal Dependencies schema and the Stanza neural pipeline. The model was evaluated on unseen data, demonstrating its effectiveness in accurately predicting POS and morphological feature tags, lemmas, and dependency relations for text in Albanian language.
Keywords
References
- B. Hasanaj and M. Biba, A Part of Speech Tagging Model for Albanian: An innovative solution, LAP Lambert Academic Publishing. ISBN 13: 9783659223273. Master’s Thesis, (2011)
- A. Kadriu, NLTK Tagger for Albanian using Iterative Approach, in Proceedings of the 35th International Conference on Information Technology Interfaces, (2013)
- J. Trommer and D. Kallulli, A Morphological Tagger for Standard Albanian. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), (2004).
- I. Collaku and E. Adal, Morphological parsing of albanian language: a different approach to albanian verbs, in International Conference on Computer Science and Communication Engineering, (2015)
- E. Salavaci and M. Biba, Enhancing Part-of- Speech Tagging in Albanian with Large Tagsets. Master’s Thesis, (2012).
- B. Kabashi and T. Proisl, Albanian Part-of-Speech Tagging: Gold Standard and Evaluation, in Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), (2018).
- N. Kote, M. Biba, J. Kanerva, S. Rönnqvist and F. Ginter, Morphological Tagging and Lemmatization of Albanian: A Manually Annotated Corpus and Neural Models, arXiv:1912.00991, (2019)
- A. Misini, E. Canhasi and S. Krrabaj, Albanian Syntactic Parsing, in ICT Innovations 2020, (2020)
- D. Mati, M. Hamiti and E. Mollakuqe, Morphological Tagging and Lemmatization in the Albanian Language, SEEU Review, vol. 16, no. 2, pp. 3-16, (2021)
- T. Arkhangelskıj, M. Danıel, M. Morozova and A. Rusakov, Korpusı i Gjuhës Shqıpe: Drejtımet Kryesore të Punës, in Albanıan And Balkan Languages, Scientific Conference, Prishtinë, (2011).
- C. Kirov, R. Cotterell, J. Sylak-Glassman, G. Walther, E. Vylomova, P. Xia, M. Faruqui, S. Mielke, A. McCarthy, S. Kubler, D. Yarowsky, J. Eisner and M. Hulden, UniMorph 2.0: Universal Morphology, in The Eleventh International Conference on Language Resources and Evaluation, (2018)
- J. Ke, Q. Jin, S. Yau, T. Han, Y. Feng, X. Wang, Z. Hu, E. Laci, E. Allmetaj, T. Chen, M. Mu, Y. Zhou, S. Zhou, Z. Zhang, Y. Lu and W. ai, The sqGLOBE Corpus (a balanced corpus of 1M-word contemporary written Albanian, lemmatised and PoS-tagged), in The school of European Languages and Cultures, BFSU
- M. Toska, J. Nivre and D. Zeman, Universal Dependencies for Albanian, in Proceedings of the Fourth Workshop on Universal Dependencies (UDW 2020), (2020)
- C. Ebert, A. Kuqi, P. Widmer and B. Sonnenhauser, UD Gheg Pear Stories: An annotated treebank of Gheg Albanian as spoken in Switzerland, Research Square, (2022)
- N. Kote, R. Rushiti, A. Çepani , A. Haveriku, E. Trandafili, E. Kajo Mece, E. Skenderi Rakipllari, L. Xhanari and A. Deda, "Universal Dependencies Treebank for Standard Albanian: A new approach," in Proceedings of the Sixth International Conference Computational Linguistics in Bulgaria (CLIB 2024), ( 2024)
- J. Nivre, M. C. de Marneffe, F. Ginter, J. Hajič, C. D. Manning, S. Pyysalo, S. Schuster,, F. Tyers and D. Zeman, Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection, in Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, (2020).
- S. Buchholz and E. Marsi, CoNLL-X shared task on Multilingual Dependency Parsing, Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL-X),p.149–164, (2006).
- P. Qi, Y. Zhang, Y. Zhang, J. Bolton and C. D. Manning. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Association for Computational Linguistics (ACL) System Demonstrations, (2020)
- [Online]. Available: https://fasttext.cc/docs/en/crawl-vectors.html.