An Enhanced Topic Extraction Model for Medical PubMed Documents using State-of-the-Art Algorithms

K.T. Mathuna; I. Elizabeth Shanthi

doi:https://doi.org/10.14445/22315381/IJETT-V74I6P101

Research Article | Open Access | Download PDF

Volume 74 | Issue 6 | Year 2026 | Article Id. IJETT-V74I6P101 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I6P101

An Enhanced Topic Extraction Model for Medical PubMed Documents using State-of-the-Art Algorithms

K.T. Mathuna, I. Elizabeth Shanthi

Received	Revised	Accepted	Published
08 Jun 2024	11 Apr 2025	20 Apr 2026	27 Jun 2026

Citation :

K.T. Mathuna, I. Elizabeth Shanthi, "An Enhanced Topic Extraction Model for Medical PubMed Documents using State-of-the-Art Algorithms," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 6, pp. 1-16, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I6P101

Abstract

The rapid growth of medical literature databases represents both a challenge and an opportunity for pharmacovigilance. Medical abstracts are full of specialized terms and complex sentences that make extracting meaningful insights on the adverse effects of drugs very challenging. This paper addresses the critical problem of extracting relevant topics related to drug adverse effects from PubMed medical abstracts using advanced topic modelling methods. It enhances the four-topic modelling with two optimization algorithms to improve topic extraction and assesses their performance, such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Long Short-Term Memory (LSTM), and Recurrent Neural Network (RNN), combined with grid search and Bayesian optimization algorithms. The experimental results show that LDA optimized with Bayesian optimization gives the highest coherence score, 0.605, which is better than other models. Coherent results, as shown in a complex comparison table, reveal the performance of each model and optimization method.

Keywords

Topic Modeling, LDA, LSA, LSTM, RNN, and Grid Search.

References

[1] Mengqian Wang et al., “A Systematic Review of Automatic Text Summarization for Biomedical Literature and EHRs,” Journal of the American Medical Informatics Association, vol. 28, no. 10, pp. 2287-2297, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[2] Usman Naseem et al., “Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT,” BMC Bioinformatics, vol. 23, no. 1, pp. 1-15, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[3] David M. Blei, Andrew Y. Ng, and Michael I. Jordan, “Latent Dirichlet Allocation,” Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003.
[Google Scholar]

[4] Scott Deerwester et al., “Indexing by Latent Semantic Analysis,” Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391-407, 1990.
[CrossRef] [Google Scholar] [Publisher Link]

[5] Thomas Hofmann, “Probabilistic Latent Semantic Indexing,” Proceedings of the 22^nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, NY, United States, pp. 50-57, 1999.
[CrossRef] [Google Scholar] [Publisher Link]

[6] Daniel D. Lee, and H. Sebastian Seung, “Algorithms for Non-Negative Matrix Factorization,” Advances in Neural Information Processing Systems, vol. 13, pp. 1-7, 2000.
[Google Scholar] [Publisher Link]

[7] Chun Yen Lee, and Yi-Ping Phoebe Chen, “Prediction of Drug Adverse Events using Deep Learning in Pharmaceutical Discovery,” Briefings in Bioinformatics, vol. 22, no. 2, pp. 1884-1901, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[8] Yankang Jing et al., “Deep Learning for Drug Design: An Artificial Intelligence Paradigm for Drug Discovery in the Big Data Era,” The American Association of Pharmaceutical Scientists, vol. 20, no. 3, pp. 1-22, 2018.
[CrossRef] [Google Scholar] [Publisher Link]

[9] Yuan Luo, “Recurrent Neural Networks for Classifying Relations in Clinical Notes,” Journal of Biomedical Informatics, vol. 72, pp. 85-95, 2017.
[CrossRef] [Google Scholar] [Publisher Link]

[10] Jiebin Chu et al., “Using Neural Attention Networks to Detect Adverse Medical Events from Electronic Health Records,” Journal of Biomedical Informatics, vol. 87, pp. 118-130, 2018.
[CrossRef] [Google Scholar] [Publisher Link]

[11] M. Schuster, and K.K. Paliwal, “Bidirectional Recurrent Neural Networks,” IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673-2681, 1997.
[CrossRef] [Google Scholar] [Publisher Link]

[12] Tongxuan Zhang et al., “Adverse Drug Reaction Detection Via a Multihop Self-Attention Mechanism,” BMC Bioinformatics, vol. 20, no. 1, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[13] Junaid Rashid et al., “Topic Modeling Technique for Text Mining Over Biomedical Text Corpora through Hybrid Inverse Documents Frequency and Fuzzy K-Means Clustering,” IEEE Access, vol. 7, pp. 146070-146080, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[14] Stefano Sbalchiero, and Maciej Eder, “Topic Modeling Long Texts and the Best Number of Topics. Some Problems and Solutions,” Quality and Quantity, vol. 54, no. 4, pp. 1095-1108, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[15] Sandhya Avasthi, Ritu Chauhan, and Debi Prasanna Acharjya, “Topic Modeling Techniques for Text Mining Over a Large-Scale Scientific and Biomedical Text Corpus,” International Journal of Ambient Computing and Intelligence, vol. 13, no. 1, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[16] Tao Chen, Mingfen Wu, and Hexi Li, “A General Approach for Improving Deep Learning-based Medical Relation Extraction using a Pre-Trained Model and Fine-Tuning,” Journal of Biological Databases and Curation, vol. 2019, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[17] Abhyuday N. Jagannatha, and Hong Yu, “Bidirectional RNN for Medical Event Detection in Electronic Health Records,” Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 473-482, 2016.
[CrossRef] [Google Scholar] [Publisher Link]

[18] Nizar Ahmed, Fatih Dilmaç, and Adil Alpkocak, “Classification of Biomedical Texts for Cardiovascular Diseases with Deep Neural Network using a Weighted Feature Representation Method,” Healthcare, vol. 8, no. 4, pp. 1-15, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[19] Amir Karami et al., “Fuzzy Approach Topic Discovery in Health and Medical Corpora,” International Journal of Fuzzy Systems, vol. 20, no. 4, pp. 1-12, 2017.
[CrossRef] [Google Scholar] [Publisher Link]

[20] Hamed Jelodar et al., “Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, A Survey,” Multimedia Tools and Applications, vol. 78, no. 11, pp. 1-40, 2018.
[CrossRef] [Google Scholar] [Publisher Link]

[21] Pooja Kherwa, and Poonam Bansal, “Topic Modeling: A Comprehensive Review,” EAI Endorsed Transactions on Scalable Information Systems, vol. 7, no. 24, pp. 1-16, 2020.
[Google Scholar]

[22] Antonio Candelieri, “A Gentle Introduction to Bayesian Optimization,” 2021 Winter Simulation Conference (WSC), Phoenix, AZ, USA, pp. 1-16, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[23] Xilu Wang et al., “Recent Advances in Bayesian Optimization,” ACM Computing Surveys, vol. 55, no. 13S, pp. 1-36, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[24] Stuart J. Blair, Yaxin Bi, and Maurice D. Mulvenna, “Aggregated Topic Models for Increasing Social Media Topic Coherence,” Applied Intelligence, vol. 50, no. 1, pp. 138-156, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[25] Daniel Maier et al., “Applying LDA Topic Modeling in Communication Research: Toward a Valid and Reliable Methodology,” Communication Methods and Measures, vol. 12, no. 2-3, pp. 93-118, 2018.
[CrossRef] [Google Scholar] [Publisher Link]

[26] Neha, Latent Dirichlet Allocation (LDA) and Topic Modeling using Gensim and Sklearn, 2025. [Online]. Available: https://www.analyticsvidhya.com/blog/2021/06/part-2-topic-modeling-and-latent-dirichlet-allocation-lda-using-gensim-and-sklearn/

[27] Harshit Agarwal, Topic Modelling Using LDA and LSA with Python Implementation, Enjoyalgorithms.Com, 2022. [Online]. Available: https://www.enjoyalgorithms.com/blog/topic-modelling-using-lda-lsa

[28] Ihsan Ahsanu Amala, Donni Richasdy, and Mahendra Dwifebri Purbolaksono, “Telkom University News Topic Modeling using Latent Semantic Analysis (LSA) Method on Online News Portal,” Building of Informatics, Technology and Science (BITS), vol. 4, no. 1, pp. 110-115, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[29] Hamed Jelodar et al., “Deep Sentiment Classification and Topic Discovery on Novel Coronavirus or Covid-19 Online Discussions: NLP using LSTM Recurrent Neural Network Approach,” IEEE Journal of Biomedical and Health Informatics, vol. 24, no. 10, pp. 2733-2742, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[30] Ishaani Priyadarshini, and Chase Cotton, “A Novel LSTM-CNN Grid Search based Deep Neural Network for Sentiment Analysis,” The Journal of Supercomputing, vol. 77, no. 12, pp. 13911-13932, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[31] Mohammad Ehsan Basiri et al., “ABCDM: An Attention-based Bidirectional CNN-RNN Deep Model for Sentiment Analysis,” Future Generation Computer System, vol. 115, pp. 279-294, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[32] Minyong Shi, Jingyi Huang, and Chunfang Li, “Entity Relationship Extraction based on BLSTM Model,” 2019 IEEE/ACIS 18^th International Conference on Computer and Information Science (ICIS), Beijing, China, pp. 266-269, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[33] Beakcheol Jang et al., “Bi-LSTM Model to Increase Accuracy in Text Classification: Combining Word2vec CNN and Attention Mechanism,” Applied Sciences, vol. 10, no. 17, pp. 1-14, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[34] Feng Yi, Bo Jiang, and Jianjun Wu, “Topic Modeling for Short Texts via Word Embedding and Document Correlation,” IEEE Access, vol. 8, pp. 30692-30705, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[35] Muhammad Inaam ul haq, and Qianmu Li, “Revealing the Trends in the Academic Landscape of the Health Care System using Contextual Topic Modelling,” Data Intelligence, vol. 5, no. 4, pp. 923-946, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[36] Amna Meddeba, and Lotfi Ben Romdhane, “Using Topic Modeling and Word Embedding for Topic Extraction in Twitter,” Procedia Computer Science, vol. 207, pp. 790-799, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[37] Rahul Kumar Gupta et al., “Prediction of Research Trends using LDA based Topic Modeling,” Global Transitions Proceedings, vol. 3, no. 1, pp. 298-304, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[38] S. Sivanandham et al., “Analysing Research Trends using Topic Modelling and Trend Prediction,” Soft Computing and Signal Processing: Proceedings of 3^rd ICSCSP, Springer, Singapore, vol 1325, pp. 157-166, 2021.
[Google Scholar] [Publisher Link]