Identification of Word Level Information Based Semantic Similarity Using Extended Glove Embeddings for Clustering and Classification Analysis

Identification of Word Level Information Based Semantic Similarity Using Extended Glove Embeddings for Clustering and Classification Analysis

	© 2024 by IJETT Journal
	Volume-72 Issue-8
	Year of Publication : 2024
	Author : Rama Krishna Paladugu, Gangadhara Rao Kancherla
	DOI : 10.14445/22315381/IJETT-V72I8P121

How to Cite?
Rama Krishna Paladugu, Gangadhara Rao Kancherla, "Identification of Word Level Information Based Semantic Similarity Using Extended Glove Embeddings for Clustering and Classification Analysis," International Journal of Engineering Trends and Technology, vol. 72, no. 8, pp. 212-227, 2024. Crossref, https://doi.org/10.14445/22315381/IJETT-V72I8P121

Abstract
In this article, an enhanced methodology for document representation and classification leveraging the Extended GloVe (ExGloVe) algorithm is presented. The ExGloVe algorithm extends the traditional GloVe model by incorporating subword information and domain-specific adaptations, addressing limitations in capturing semantic nuances and domain-specific language variations. The incorporation of subword information enables the algorithm to better represent rare and out-of-vocabulary words, enhancing the expressiveness and robustness of the embeddings. Domain-specific adaptations tailor the embeddings to specific domains, capturing domain-specific semantics and improving performance in domain-specific tasks. Document-level embeddings obtained through the aggregation process are utilized as input features for clustering algorithms such as K-Means, DBSCAN, and Hierarchical Clustering, as well as classification models including Support Vector Machine, Logistic Regression, and Neural Networks. These models leverage the semantic richness encoded in the ExGloVe embeddings for effective document analysis. Experiments with various evaluation metrics are conducted to validate the efficacy of the proposed methodology in document similarity measurement, clustering, and classification tasks.

Keywords
ExGloVe algorithm, Subword incorporation, Domain-specific adaptations, Document similarity measurement, clustering and classification, Natural language processing.

References
[1] Diksha Khurana et al., “Natural Language Processing: State of The Art, Current Trends and Challenges,” Multimedia Tools and Applications, vol. 82, pp. 3713-3744, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Jeffrey Pennington, Richard Socher, and Christopher Manning, “GloVe: Global Vectors for Word Representation,” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, pp. 1532-1543, 2014.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Mohammad Taher Pilevar, and Nigel Collier, “Inducing Embeddings for Rare and Unseen Words by Leveraging Lexical Resources,” Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, pp. 388-393, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Xiaotao Li et al., “Learning Embeddings for Rare Words Leveraging Internet Search Engine and Spatial Location Relationships,” Proceedings of SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics, pp. 278-287, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Zhongyu Zhuang et al., “Out-of-Vocabulary Word Embedding Learning Based on Reading Comprehension Mechanism,” Natural Language Processing Journal, vol. 5, pp. 1-6, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Van-Tan Bui et al., “Combining Specialized Word Embeddings and Subword Semantic Features for Lexical Entailment Recognition,” Data & Knowledge Engineering, vol. 141, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Debora Nozza et al., “LearningToAdapt with word embeddings: Domain Adaptation of Named Entity Recognition Systems,” Information Processing & Management, vol. 58, no. 3, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Asana Neishabouri, and Michel C. Desmarais, “Inferring the Number and Order of Embedded Topics Across Documents,” Procedia Computer Science, vol. 192, pp. 1198-1207, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Shapol M. Mohammed, Karwan Jacksi, and Subhi R. M. Zeebaree, “Glove Word Embedding and DBSCAN algorithms for Semantic Document Clustering,” International Conference on Advanced Science and Engineering, Duhok, Iraq, pp. 1-6, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[10] R. Suganthi, and K. Prabha, “Fuzzy Similarity Based Hierarchical Clustering for Communities in Twitter Social Networks,” Measurement: Sensors, vol. 32, pp. 1-7, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Milad Moradi, Maedeh Dashti, and Matthias Samwald, “Summarization of Biomedical Articles using Domain-Specific Word Embeddings and Graph Ranking,” Journal of Biomedical Informatics, vol. 107, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Qing Li et al., “Logistic Regression Matching Pursuit Algorithm for Text Classification,” Knowledge-Based Systems, vol. 277, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Shuo Yang et al., “Chinese Semantic Document Classification Based on Strategies of Semantic Similarity Computation and Correlation Analysis,” Journal of Web Semantics, vol. 63, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Anil Sharma, and Suresh Kumar, “Ontology-Based Semantic Retrieval of Documents Using Word2vec Model,” Data & Knowledge Engineering, vol. 144, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Tomas Mikolov et al., “Efficient Estimation of Word Representations in Vector Space,” Proceedings of the International Conference on Learning Representations, 2013.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Piotr Bojanowski et al., “Enriching Word Vectors with Subword Information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135-146. 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Zhuang Liu et al., “FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining,” Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, pp. 4513-4519, 2021.
[Google Scholar] [Publisher Link]
[18] Leilei Gan et al., “SemGloVe: Semantic Co-Occurrences for GloVe From BERT,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 2696-2704, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Tim Schopf, Dennis N. Schneider, and Florian Matthes, “Efficient Domain Adaptation of Sentence Embeddings Using Adapters,” Arxiv, pp. 1-8, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Qian Liu et al., “Domain-Specific Meta-Embedding with Latent Semantic Structures,” Information Sciences, vol. 555, pp. 410-423, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Jaekeol Choi, and Sang-Woong Lee, “Improving FastText with Inverse Document Frequency of Subwords,” Pattern Recognition Letters, vol. 133, 165-172, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Jun Yin, and Shiliang Sun, “Incomplete Multi-View Clustering with Cosine Similarity,” Pattern Recognition, vol. 123, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Ryoma Sato, Makoto Yamada, and Hisashi Kashima, “Re-Evaluating Word Mover’s Distance,” International Conference on Machine Learning, pp. 19231-19249, 2022.
[Google Scholar] [Publisher Link]
[24] Paul Rogers, Dong Wang, and Zhiyuan Lu, “Medical Information Mart for Intensive Care: A Foundation for the Fusion of Artificial Intelligence and Real-World Data,” Frontiers in Artificial Intelligence, vol. 4, pp. 1-4, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Özlem Uzuner et al., “2010 i2b2/VA Challenge on Concepts, Assertions, and Relations in Clinical Text,” Journal of the American Medical Informatics Association, vol. 18, no. 5, pp. 552-526, 2011.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Juliano Rabelo et al., “COLIEE 2020: Methods for Legal Document Retrieval and Entailment,” New Frontiers in Artificial Intelligence, pp. 196-210, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Haoxi Zhong et al., “JEC-QA: A Legal-Domain Question Answering Dataset,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 5, pp. 9701-9708, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[28] R. Prasanna Kumar et al., “Automated Sentiment Classification of Amazon Product Reviews using LSTM and Bidirectional LSTM,” International Conference on Evolutionary Algorithms and Soft Computing Techniques, Bengaluru, India, pp. 1-6, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

IJBTT

Identification of Word Level Information Based Semantic Similarity Using Extended Glove Embeddings for Clustering and Classification Analysis