Must: Machine Learning Based Unsupervised Multi-Lingual Morpho-Semantic Textual Processor for Natural Languages

Anjali Bohra; Nemi Chand Barwar

doi:https://doi.org/10.14445/22315381/IJETT-V73I3P139

Research Article | Open Access | Download PDF

Volume 73 | Issue 3 | Year 2025 | Article Id. IJETT-V73I3P139 | DOI : https://doi.org/10.14445/22315381/IJETT-V73I3P139

Must: Machine Learning Based Unsupervised Multi-Lingual Morpho-Semantic Textual Processor for Natural Languages

Anjali Bohra, Nemi Chand Barwar

Received	Revised	Accepted	Published
18 Jul 2024	21 Jan 2025	27 Jan 2025	28 Mar 2025

Citation :

Anjali Bohra, Nemi Chand Barwar, "Must: Machine Learning Based Unsupervised Multi-Lingual Morpho-Semantic Textual Processor for Natural Languages," International Journal of Engineering Trends and Technology (IJETT), vol. 73, no. 3, pp. 554-560, 2025. Crossref, https://doi.org/10.14445/22315381/IJETT-V73I3P139

Abstract

A word is a continuous sequence of alphabetic characters classified and recognized by unique patterns or rules. Morphological structure suffix (affix) of the word with syntactic and semantic representation. Grammatical information of words is marked through inflectional suffixes. Morphological analysis helps perceive a word’s semantic and syntactic properties and can be implemented using morpheme-based, lexeme-based, or word-based approaches. Syntactic and semantic analysis is a classification process for placing words in pre-defined groups. Karakas (case) are the classes specifying the relationship of words in a sentence. The paper performs multi-lingual semantic analysis and implements a morphological processor. The multi-lingual semantic analysis of the Sanskrit and the English language is performed, followed by the generation of an unsupervised learning-based morphological processor for English. Word Embedding based approach is used for comparative analysis of Sanskrit and English languages using datasets prepared through available online textual repositories for both languages. The obtained result serves as a motivation for unsupervised morpho-semantic processors. The proposed PFMP algorithm performs morphological processing to extract the root word of the language with its attributes like number, gender, suffix, and karaka(case). The model is trained using the Keras deep learning framework with 15 nouns, 15 unique suffixes and 255 unique inflections of the English language. With limited data and only 20 epochs, the model obtained 52 percent of recall. The system can be used as a generalized platform for extracting linguistic information for a specific language when trained with language-specific grammatical knowledge.

Keywords

Deep learning, Karaka Relations, Morphological processing, Natural language processing, Semantic analysis.

References

[1] B. Premjith, and K.P. Soman, “Deep Learning Approach for the Morphological Synthesis in Malayalam and Tamil at the Character level,” ACM Transaction on Asian and Low-Resource Language Information Processing, vol. 20, no. 6, pp. 1-17, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Deepti Chopra, Nisheeth Joshi, and Iti Mathur, Mastering Natural Language Processing with Python, Packet Publishing, pp. 1-238, 2016.
[Google Scholar] [Publisher Link]
[3] Remya Sivan, “Study on Morphological Analyzer and Generator for Malayalam,” International Journal of Engineering Science Invention, vol. 8, no. 3, pp. 73-77, 2019.
[Publisher Link]
[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” Arxiv, pp. 1-15, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[5] M. Anand Kumar et al., “Morphological Analyzer for Agglutinative Languages Using Machine Learning Approaches,” 2009 International Conference on Advances in Recent Technologies in Communication and Computing, Kottayam, India, pp. 433-435, 2009.
[CrossRef] [Google Scholar] [Publisher Link]
[6] V.P. Abeera et al., “Morphological Analyzer for Malayalam Using Machine Learning,” Data Engineering and Management: Second International Conference, ICDEM 2010, Tiruchirappalli, India, pp. 252-254, 2012.
[CrossRef] [Google Scholar] [Publisher Link]
[7] B. Premjith, K.P. Soman, and M. Anand Kumar, “A Deep Learning Approach for Malayalam Morphological Analysis at Character Level,” Procedia Computer Science, vol. 132, pp. 47-54, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Saurav Jha, Akhilesh Sudhakar, and Anil Kumar Singh, “Multi Task Morphological Analyzer: Context Aware Neural Joint Morphological Tagging and Lemma Prediction,” Arxiv, pp. 1-28, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[9] C. Rahul, and R. Gopikakumari, “A Character Level Sanskrit-Malayalam Parallel Morphological Analyzer Using Deep Learning,” Design Engineering, pp. 994-1021, 2021.
[Google Scholar]
[10] Ryan Cotterell, and Hinrich Schütze, “Morphological Word Embeddings,” Arxiv, pp. 1-6, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Jenny Audring, and Francesca Masini, Introduction: Theory and Theories in Morphology, The Oxford Handbook of Morphological Theory, pp. 1-16, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Dinesh Ramoo, Psychology of Language, 2021. [Online]. Available: https://opentextbc.ca/psyclanguage/
[13] Mariana Neves, Natural Language Processing, SoSe, 2017. [Online]. Available:
https://hpi.de/oldsite/fileadmin/user_upload/fachgebiete/plattner/teaching/NaturalLanguageProcessing/NLP2017/NLP01_IntroNLP.pdf
[14] F. Staal, Pannian Linguistics, 2021. [Online]. Available: https://web.stanford.edu/class/linguist289/encyclopaedia001.pdf
[15] Sherly Elizabeth, N. Rajendran, and R.R. Rajeev, “A Suffix Stripping Based Morph Analyser for Malayalam Language,” Proceedings of 20th Kerala Science Congress, pp. 482-484, 2007.
[Google Scholar]
[16] John A. Goldsmith, Derrick Higgins, and Svetlana Soglasnova, “Automatic Language-Specific Stemming in Information Retrieval,” Cross-Language Information Retrieval and Evaluation: Workshop of the Cross-Language Evaluation Forum for European Languages, Lisbon, Portugal, pp. 273-283, 2001.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Fred Karlsson, “A Paradigm-Based Morphological Analyzer,” Proceedings of the 5th Nordic Conference of Computational Linguistics (NODALIDA 1985), Hels Liane Guillouinki, Finland, pp. 95-112, 1986.
[Google Scholar] [Publisher Link]
[18] Alexander Fraser, and Liane Guillou, Two Level Morphology, Computational Morphology and Electronic Dictionaries, 2016. [Online]. Available: https://www.cis.uni-muenchen.de/~fraser/morphology_2016/two_level_morph.pdf [19] Kimmo Koskenniemi, “Two-Level Model for Morphological Analysis,” Proceedings of the Eighth International Joint Conference on Artificial Intelligence, pp. 1-3, 2020.
[Google Scholar] [Publisher Link]
[20] Kyriakos N. Sgarbas, Nikos D. Fakotakis, and George K. Kokkinakis, “A Straightforward Approach to Morphological Analysis and Synthesis,” Arxiv, pp. 1-6, 2001.
[CrossRef] [Google Scholar] [Publisher Link]
[21] P.M. Vinod, Jayan Vasudevan, and V.K. Bhadran, “Implementation of Malayalam Morphological Analyzer Based on Hybrid Method,” Proceedings of the 24th Conference on Computational Linguistics and Speech Processing (ROCLING 2012), Chung-Li, Taiwan, pp. 307-317, 2012.
[Google Scholar] [Publisher Link]
[22] Vector Search, Google Cloud. [Online]. Available: https://cloud.google.com/vertex-ai/docs/vector-search/overview [23] Dan Jurafsky, Speech and Language Processing, Pearson Education, pp. 1-908, 2020.
[Publisher Link]
[24] Bin Wang et al., “Evaluating Word Embedding Models: Methods and Experimental Results,” APSIPA Transaction on Signal and Information Processing, vol. 8, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Amir Bakarov, “A Survey of Word Embeddings Evaluation Methods,” Arxiv, pp. 1-26, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Kris Cao, and Marek Rei, “A Joint Model for Word Embeddinig and Word Morphology,” Proceedings of First Workshop on Representation Learning for NLP, Berlin, Germany, pp. 18-26, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Yoshua Bengio et al, “A Neural Probabilistic Language Model,” Journal of Machine Learning Research, vol. 3, pp. 1137-1155, 2003.
[Google Scholar] [Publisher Link]
[28] Ronan Collobert et al., “Natural Language Processing (Almost) from Scratch,” Journal of Machine Learning Research, vol. 12, no. 76, pp. 2493-2537, 2011.
[Google Scholar] [Publisher Link]