Toward Accurate Contextual Arabic Lemmatization Using a Deep Learning Approach

Toward Accurate Contextual Arabic Lemmatization Using a Deep Learning Approach

  IJETT-book-cover           
  
© 2025 by IJETT Journal
Volume-73 Issue-9
Year of Publication : 2025
Author : Driss Namly, Hakima Khamar, Karim Bouzoubaa, Fakhreldin Saeed
DOI : 10.14445/22315381/IJETT-V73I9P126

How to Cite?
Driss Namly, Hakima Khamar, Karim Bouzoubaa, Fakhreldin Saeed,"Toward Accurate Contextual Arabic Lemmatization Using a Deep Learning Approach", International Journal of Engineering Trends and Technology, vol. 73, no. 9, pp.309-317, 2025. Crossref, https://doi.org/10.14445/22315381/IJETT-V73I9P126

Abstract
In the age of artificial intelligence, effective processing of unstructured textual data is critical, especially for languages with rich morphology such as Arabic. Lemmatization, the process of reducing words to their base or dictionary form, is important in various Natural Language Processing (NLP) applications. Arabic exhibits specific challenges due to its rich morphology, lexical ambiguity, and the absence of diacritics in most texts. Existing Arabic lemmatizers often struggle with context-aware disambiguation, rely heavily on proprietary datasets, or produce overwhelming morphological outputs unsuitable for non-experts. This study introduces SafarLemmatizer2, an advanced Arabic lemmatizer designed to address these limitations. Built upon the original SafarLemmatizer, the new tool integrates BiLSTM and BERT deep learning architectures to enhance contextual lemma selection while maintaining the accuracy of SafarLemmatizer's context-free lemmatization. The study determines the optimal architecture for contextual disambiguation through rigorous evaluation and provides a scalable lemmatization tool suitable for diverse NLP tasks. SafarLemmatizer2 thus represents a significant step forward in Arabic NLP, bridging the gap between traditional morphological analysis and modern deep learning-based approaches.

Keywords
Arabic NLP, Arabic contextual lemmatization, Deep learning, BERT, BiLSTM.

References
[1] Xue Jiang et al., “Applications of Natural Language Processing and Large Language Models in Materials Discovery,” Nature Publishing Journal Computational Materials, vol. 11, no. 1, pp. 1-15, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Mete Ismayilzada et al., “Evaluating Morphological Compositional Generalization in Large Language Models,” Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies, Albuquerque, New Mexico, pp. 1270-1305, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Khaled Shaalan et al., Challenges in Arabic Natural Language Processing, Computational Linguistics, Speech and Image Processing for the Arabic Language, pp. 59-83, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Oumaima Zine, Abdelouafi Meziane, and Mohamed Boudchiche, “Towards a High-Quality Lemma-Based Text to Speech System for the Arabic Language,” International Conference on Arabic Language Processing, pp. 53-66, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Tahani Almutairi et al., “Preprocessing Techniques for Clustering Arabic Text: Challenges and Future Directions,” International Journal of Advanced Computer Science and Applications, vol. 15, no. 8, pp. 1301-1314, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Asma Bader Al-Saleh, and Mohamed El Bachir Menai, “Automatic Arabic Text Summarization: A Survey,” Artificial Intelligence Review, vol. 45, no. 2, pp. 203-234, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Mohamed Seghir Hadj Ameur, Farid Meziane, and Ahmed Guessoum, “Arabic Machine Translation: A Survey of the Latest Trends and Challenges,” Computer Science Review, vol. 38, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Rawan Al-Matham et al., “KSAA-RD Shared Task: Arabic Reverse Dictionary,” Proceedings of ArabicNLP 2023, Singapore, pp. 450-460, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Mohamed Boudchiche et al., “AlKhalil Morpho Sys 2: A Robust Arabic Morpho-Syntactic Analyzer,” Journal of King Saud University-Computer and Information Sciences, vol. 29, no. 2, pp. 141-146, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Dima Taji et al., “An Arabic Morphological Analyzer and Generator with Copious Features,” Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology, Brussels, Belgium, pp. 140-150, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Ossama Obeid et al., “CAMeL Tools: An Open Source Python Toolkit for Arabic Natural Language Processing,” Proceedings of the Twelfth Language Resources and Evaluation Conference, Marseille, France, pp. 7022-7032, 2020.
[Google Scholar] [Publisher Link]
[12] Waleed Nazih et al., “Ibn-Ginni: An Improved Morphological Analyzer for Arabic,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 23, no. 2, pp. 1-22, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Arfath Pasha et al., “Madamira: A Fast, Comprehensive Tool for Morphological Analysis and Disambiguation of Arabic,” Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland, pp. 1094-1101, 2014.
[Google Scholar] [Publisher Link]
[14] Ahmed Abdelali et al., “Farasa: A Fast and Furious Segmenter for Arabic,” Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, San Diego, California, pp. 11-16, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Abed Alhakim Freihat et al., “Towards an Optimal Solution to Lemmatization in Arabic,” Procedia Computer Science, vol. 142, pp. 132-140, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Mohamed Boudchiche, and Azzeddine Mazroui, “A Hybrid Approach for Arabic Lemmatization,” International Journal of Speech Technology, vol. 22, no. 3, pp. 563-573, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Mustafa Jarrar, Diyam Akra, and Tymaa Hammouda, “ALMA: Fast Lemmatizer and POS Tagger for Arabic,” Procedia Computer Science, vol. 244, pp. 378-387, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Amitha Mathew, P. Amudha, and S. Sivakumari, “Deep Learning Techniques: An Overview,” International Conference on Advanced Machine Learning Technologies and Applications, Singapore, pp. 599-608, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Chandra Nidhi et al., “Utilizing Gated Recurrent Units to Retain Long Term Dependencies with Recurrent Neural Network in Text Classification,” Journal of Information Systems and Telecommunication, vol. 9, no. 34, pp. 89-102, 2021.
[Google Scholar] [Publisher Link]
[20] Kai Han et al., “Transformer in Transformer,” Advances in Neural Information Processing Systems, vol. 34, pp. 15908-15919, 2021.
[Google Scholar] [Publisher Link]
[21] Gilberto Rivera et al., Innovative Applications of Artificial Neural Networks to Data Analytics and Signal Processing, Springer Nature, vol. 1171, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[22] A. Dennis Ananth et al., Deep Learning & Applications, Quill Tech Publications, pp. 1-220, 2024.
[Google Scholar]
[23] Mohamed Boudchiche, and Azzeddine Mazroui, “Enrichment of the Nemlar Corpus by the Lemma Tag,” Workshop Language Resources of Arabic NLP: Construction, Standardization, Management and Exploitation, Rabat, Morocco, 2015.
[Google Scholar]
[24] Bente Maegaard, “The Nemlar Project on Arabic Language Resources,” Proceedings of the 9th EAMT Workshop: Broadening Horizons of Machine Translation and its Application, Malta, pp. 124-128, 2004.
[Google Scholar] [Publisher Link]
[25] Mustafa Jarrar et al., “Salma: Arabic Sense-Annotated Corpus and WSD Benchmarks,” arXiv preprint, pp. 1-11, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Wissam Antoun, Fady Baly, and Hazem Hajj, “AraBERT: Transformer-based Model for Arabic Language Understanding,” arXiv preprint, pp. 1-7, 2020.
[CrossRef] [Google Scholar] [Publisher Link]