Enhancing Natural Language Processing in Somali Text Classification: A Comprehensive Framework for Stop Word Removal

Abdullahi Ahmed Abdirahman; Abdirahman Osman Hashi; Ubaid Mohamed Dahir; Mohamed Abdirahman Elmi; Octavio Ernest Romo Rodriguez

doi:https://doi.org/10.14445/22315381/IJETT-V71I12P205

Research Article | Open Access | Download PDF

Volume 71 | Issue 12 | Year 2023 | Article Id. IJETT-V71I12P205 | DOI : https://doi.org/10.14445/22315381/IJETT-V71I12P205

Enhancing Natural Language Processing in Somali Text Classification: A Comprehensive Framework for Stop Word Removal

Abdullahi Ahmed Abdirahman, Abdirahman Osman Hashi, Ubaid Mohamed Dahir, Mohamed Abdirahman Elmi, Octavio Ernest Romo Rodriguez

Received	Revised	Accepted	Published
31 Jul 2023	12 Oct 2023	24 Oct 2023	06 Dec 2023

Citation :

Abdullahi Ahmed Abdirahman, Abdirahman Osman Hashi, Ubaid Mohamed Dahir, Mohamed Abdirahman Elmi, Octavio Ernest Romo Rodriguez, "Enhancing Natural Language Processing in Somali Text Classification: A Comprehensive Framework for Stop Word Removal," International Journal of Engineering Trends and Technology (IJETT), vol. 71, no. 12, pp. 40-49, 2023. Crossref, https://doi.org/10.14445/22315381/IJETT-V71I12P205

Abstract

Text classification is a prominent field of study in information retrieval and natural language processing, where a crucial component is the utilization of a stop word list. This list helps identify frequently occurring words that have little relevance in classification and are consequently removed during pre-processing. Although various stopword lists have been devised for the English language, a standardized stopword list specifically tailored for Somali text classification is yet to be established. This research presents a comprehensive framework for stop word removal in the context of the Somali language, aiming to enhance the effectiveness of various natural language processing (NLP) tasks. The proposed methodology encompasses several essential steps, including noise identification, noise removal, character normalization, data masking, tokenization, POS tagging, and lemmatization. By analysing a substantial dataset containing 79,741,231 tokens and 71,871,585 words, the framework demonstrates its capability to identify and eliminate stop words, thereby reducing vector space and improving the performance of NLP algorithms. The research highlights the unique linguistic features of Somali, such as contextual variations and morphological complexities. It discusses the potential applications of the developed stop word list in sentiment analysis, information retrieval, and document classification. This work contributes valuable insights to the field of language technology, particularly in underrepresented languages, and paves the way for further advancements in NLP models tailored to diverse linguistic contexts.

Keywords

Somali language, Stopword removal, Natural Language Processing, Stopword list, Ontology.

References

[1] Prafulla B. Bafna, and Jatinderkumar R. Saini, “Topic Identification and Prediction Using Sanskrit Hysynset,” Pervasive Computing and Social Networking: Proceedings of ICPCSN, Singapore: Springer Nature Singapore, pp. 183-196, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Fathima Farhath, and Fathima Farhath, “Towards Stop Words Identification in Tamil Text Clustering,” International Journal of Advanced Computer Science and Applications, vol. 12, no. 12, pp. 524-529, 2021.
[Google Scholar] [Publisher Link]
[3] Senem Kumova Metin, and Bahar karaoğlan, “Stop Word Detection as a Binary Classification Problem,” Anadolu University Journal of Science and Technology A - Applied Sciences and Engineering, vol. 18, no. 2, pp. 346-359, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Elmurod Kuriyozov, Yerai Doval, and Carlos Gómez-Rodríguez, “Cross-Lingual Word Embeddings for Turkic Languages,” arXiv preprint arXiv:2005.08340, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[5] A.A.V.A Jayaweera, Y.N. Senanayake, and Prasanna S. Haddela, “Dynamic Stopword Removal for Sinhala Language,” National Information Technology Conference, IEEE, pp. 1- 6, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Khabibulla Madatov, Shukurla Bekchanov, and Jernej Vičič, “Automatic Detection of Stop Words for Texts in the Uzbek Language,” Preprints, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Sanatbek Matlatipov, Ualsher Tukeyev, and Mersaid Aripov, “Towards the Uzbek Language Endings as a Language Resource,” Advances in Computational Collective Intelligence: 12th International Conference, ICCCI 2020, Springer International Publishing, pp. 729-740, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Ilyos Rabbimov, Sami Kobilov, and Iosif Mporas, “Uzbek News Categorization Using Word Embeddings and Convolutional Neural Networks,” IEEE 14th International Conference on Application of Information and Communication Technologies, pp. 1-5, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[9] C. Silva, and B. Ribeiro, “The Importance of Stop Word Removal on Recall Values in Text Categorization,” Proceedings of the International Joint Conference on Neural Networks, IEEE, vol. 3, pp. 1661-1666, 2003.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Weidong Zhao et al., “WTL-CNN: A News Text Classification Method of Convolutional Neural Network Based on Weighted Word Embedding,” Connection Science, vol. 34, no. 1, pp. 2291-2312, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Kartika Resiandi, Yohei Murakami, and Arbi Haza Nasution, “A Neural Network Approach to Create Minangkabau-Indonesia Bilingual Dictionary,” 2023.
[Google Scholar] [Publisher Link]
[12] Roman Egger, and Joanne Yu, “A Topic Modeling Comparison between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts,” Frontiers in Sociology, vol. 7, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[13] J.R. Méndez et al., “Tokenising, Stemming and Stopword Removal on Anti-Spam Filtering Domain,” Current Topics in Artificial Intelligence: 11th Conference of the Spanish Association for Artificial Intelligence, Santiago de Compostela, Spain, pp. 449-458, 2005.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Stefano Ferilli, Floriana Esposito, and Domenico Grieco, “Automatic Learning of Linguistic Resources for Stopword Removal and Stemming From Text,” Procedia Computer Science, vol. 38, pp. 116-123, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Dhara J. Ladani, and Nikita P. Desai, “Stopword Identification and Removal Techniques on TC and IR Applications: A Survey,” 6 th International Conference on Advanced Computing and Communication Systems, IEEE, pp. 466-472, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Tanveer Singh Kochhar, and Gulshan Goyal, “Design and Implementation of Stop Words Removal Method for Punjabi Language Using Finite Automata,” Advances in Data Computing, Communication and Security: Proceedings of I3CS2021, Singapore: Springer Nature Singapore, pp. 89-98, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Aditya Wiha Pradana, and Mardhiya Hayaty, “The Effect of Stemming and Removal of Stopwords on the Accuracy of Sentiment Analysis on Indonesian-Language Texts,” Kinetik: Game Technology, Information System, Computer Network, Computing, Electronics, and Control, vol. 4, no. 4, pp. 375-380, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[18] A.A.V.A Jayaweera, Y.N Senanayake, and Prasanna S. Haddela, “Dynamic Stopword Removal for Sinhala Language,” National Information Technology Conference, IEEE, pp. 1-6, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[19] W.G.S.Parwita, “A Document Recommendation System of Stemming and Stopword Removal Impact: A Web-Based Application,” Journal of Physics: Conference Series, vol. 1469, no. 1, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Yaohou Fan, Chetan Arora, and Christoph Treude, “Stop Words for Processing Software Engineering Documents: Do they Matter?,” arXiv preprint arXiv:2303.10439, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Siba Sankar Sahu, and Sukomal Pal, “A Study on Corpus-Based Stopword Lists in Indian Language IR,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 7, pp. 1-22, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Vandana Jha et al., “HSRA: Hindi Stopword Removal Algorithm,” International Conference on Microelectronics, Computing and Communications, IEEE, pp. 1-5, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Alexandra Schofield, Måns Magnusson, and David Mimno, “Pulling out the Stops: Rethinking Stopword Removal for Topic Models,” Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, pp. 432-436, 2017.
[Google Scholar] [Publisher Link]
[24] Satyendr Singh, and Tanveer J. Siddiqui, “Evaluating Effect of Context Window Size, Stemming and Stop Word Removal on Hindi Word Sense Disambiguation,” International Conference on Information Retrieval and Knowledge Management, IEEE, pp. 1-5, 2012.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Daša Munková, Michal Munk, and Martin Vozár, “Influence of Stop-Words Removal on Sequence Patterns Identification within Comparable Corpora,” ICT Innovations 2013: ICT Innovations and Education, Springer International Publishing, pp. 67-76, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Chong Tze Yuang, Rafael E. Banchs, and Chng Eng Siong, “An Empirical Evaluation of Stop Word Removal in Statistical Machine Translation,” Proceedings of the Joint Workshop on Exploiting Synergies between Information Retrieval and Machine Translation and Hybrid Approaches to Machine Translation, pp. 30-37, 2012.
[Google Scholar] [Publisher Link]
[27] A. Alajmi, E.M. Saad, and R.R. Darwish, “Toward an ARABIC Stop-Words List Generation,” International Journal of Computer Applications, vol. 46, no. 8, pp. 8-13, 2012.
[Google Scholar] [Publisher Link]
[28] A.N.K. Zaman, Pascal Matsakis, and Charles Brown, “Evaluation of Stop Word Lists in Text Retrieval using Latent Semantic Indexing,” Sixth International Conference on Digital Information Management, IEEE, pp. 133-136, 2011.
[CrossRef] [Google Scholar] [Publisher Link]
[29] R. Al-Shalabi et al., “Stop-Word Removal Algorithm for Arabic Language,” Proceedings 2004 International Conference on Information and Communication Technologies: From Theory to Applications, IEEE, p. 545, 2004.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Amaresh Kumar Pandey, and Tanvver J. Siddiqui, “Evaluating Effect of Stemming and Stop-Word Removal on Hindi Text Retrieval,” Proceedings of the First International Conference on Intelligent Human Computer Interaction, Organized by the Indian Institute of Information Technology, Allahabad, India, pp. 316-326, 2009.
[CrossRef] [Google Scholar] [Publisher Link]
[31] Eduard Dragut et al., “Stop Word and Related Problems in Web Interface Integration,” Proceedings of the VLDB Endowment, vol. 2, no. 1, pp. 349-360, 2009.
[CrossRef] [Google Scholar] [Publisher Link]