Research Article | Open Access | Download PDF
Volume 74 | Issue 6 | Year 2026 | Article Id. IJETT-V74I6P101 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I6P101An Enhanced Topic Extraction Model for Medical PubMed Documents using State-of-the-Art Algorithms
K.T. Mathuna, I. Elizabeth Shanthi
| Received | Revised | Accepted | Published |
|---|---|---|---|
| 08 Jun 2026 | 11 Apr 2026 | 20 Apr 2026 | 27 Jun 2026 |
Citation :
K.T. Mathuna, I. Elizabeth Shanthi, "An Enhanced Topic Extraction Model for Medical PubMed Documents using State-of-the-Art Algorithms," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 6, pp. 1-16, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I6P101
Abstract
The rapid growth of medical literature databases represents both a challenge and an opportunity for pharmacovigilance. Medical abstracts are full of specialized terms and complex sentences that make extracting meaningful insights on the adverse effects of drugs very challenging. This paper addresses the critical problem of extracting relevant topics related to drug adverse effects from PubMed medical abstracts using advanced topic modelling methods. It enhances the four-topic modelling with two optimization algorithms to improve topic extraction and assesses their performance, such as Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Long Short-Term Memory (LSTM), and Recurrent Neural Network (RNN), combined with grid search and Bayesian optimization algorithms. The experimental results show that LDA optimized with Bayesian optimization gives the highest coherence score, 0.605, which is better than other models. Coherent results, as shown in a complex comparison table, reveal the performance of each model and optimization method.
Keywords
Topic Modeling, LDA, LSA, LSTM, RNN, and Grid Search.
References
[1] Mengqian Wang et al., “A
Systematic Review of Automatic Text Summarization for Biomedical Literature and
EHRs,” Journal of the American Medical Informatics Association, vol. 28,
no. 10, pp. 2287-2297, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Usman Naseem et al.,
“Benchmarking for Biomedical Natural Language Processing Tasks with a Domain
Specific ALBERT,” BMC Bioinformatics, vol. 23, no. 1, pp. 1-15, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[3] David M. Blei, Andrew Y.
Ng, and Michael I. Jordan, “Latent Dirichlet Allocation,” Journal of Machine
Learning Research, vol. 3, pp. 993-1022, 2003.
[Google Scholar]
[4] Scott Deerwester et al.,
“Indexing by Latent Semantic Analysis,” Journal of the American Society for
Information Science, vol. 41, no. 6, pp. 391-407, 1990.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Thomas Hofmann, “Probabilistic Latent Semantic Indexing,” Proceedings
of the 22nd Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, Association for Computing
Machinery, New York, NY, United States, pp. 50-57, 1999.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Daniel D. Lee, and H. Sebastian Seung, “Algorithms for
Non-Negative Matrix Factorization,” Advances in Neural Information
Processing Systems, vol. 13, pp. 1-7, 2000.
[Google Scholar] [Publisher Link]
[7] Chun Yen Lee, and Yi-Ping Phoebe Chen, “Prediction of Drug
Adverse Events using Deep Learning in Pharmaceutical Discovery,” Briefings
in Bioinformatics, vol. 22, no. 2, pp. 1884-1901, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Yankang Jing et al., “Deep Learning for Drug Design: An Artificial
Intelligence Paradigm for Drug Discovery in the Big Data Era,” The American
Association of Pharmaceutical Scientists, vol. 20, no. 3, pp. 1-22, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Yuan Luo, “Recurrent Neural
Networks for Classifying Relations in Clinical Notes,” Journal of Biomedical
Informatics, vol. 72, pp. 85-95, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Jiebin Chu et al., “Using
Neural Attention Networks to Detect Adverse Medical Events from Electronic
Health Records,” Journal of Biomedical Informatics, vol. 87, pp.
118-130, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[11] M. Schuster, and K.K.
Paliwal, “Bidirectional Recurrent Neural Networks,” IEEE Transactions on
Signal Processing, vol. 45, no. 11, pp. 2673-2681, 1997.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Tongxuan Zhang et al.,
“Adverse Drug Reaction Detection Via a Multihop Self-Attention Mechanism,” BMC
Bioinformatics, vol. 20, no. 1, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Junaid Rashid et al.,
“Topic Modeling Technique for Text Mining Over Biomedical Text Corpora through
Hybrid Inverse Documents Frequency and Fuzzy K-Means Clustering,” IEEE
Access, vol. 7, pp. 146070-146080, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Stefano Sbalchiero, and
Maciej Eder, “Topic Modeling Long Texts and the Best Number of Topics. Some
Problems and Solutions,” Quality and Quantity, vol. 54, no. 4, pp.
1095-1108, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Sandhya Avasthi, Ritu
Chauhan, and Debi Prasanna Acharjya, “Topic Modeling Techniques for Text Mining
Over a Large-Scale Scientific and Biomedical Text Corpus,” International
Journal of Ambient Computing and Intelligence, vol. 13, no. 1, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Tao Chen, Mingfen Wu, and
Hexi Li, “A General Approach for Improving Deep Learning-based Medical Relation
Extraction using a Pre-Trained Model and Fine-Tuning,” Journal of Biological
Databases and Curation, vol. 2019, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Abhyuday N. Jagannatha, and
Hong Yu, “Bidirectional RNN for Medical Event Detection in Electronic Health
Records,” Proceedings of the 2016 Conference of the North American Chapter
of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp.
473-482, 2016.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Nizar Ahmed, Fatih Dilmaç,
and Adil Alpkocak, “Classification of Biomedical Texts for Cardiovascular
Diseases with Deep Neural Network using a Weighted Feature Representation
Method,” Healthcare, vol. 8, no. 4, pp. 1-15, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Amir Karami et al., “Fuzzy
Approach Topic Discovery in Health and Medical Corpora,” International
Journal of Fuzzy Systems, vol. 20, no. 4, pp. 1-12, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Hamed Jelodar et al.,
“Latent Dirichlet Allocation (LDA) and Topic Modeling: Models, Applications, A
Survey,” Multimedia Tools and Applications, vol. 78, no. 11, pp. 1-40,
2018.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Pooja Kherwa, and Poonam
Bansal, “Topic Modeling: A Comprehensive Review,” EAI Endorsed Transactions
on Scalable Information Systems, vol. 7, no. 24, pp. 1-16, 2020.
[Google Scholar]
[22] Antonio Candelieri, “A
Gentle Introduction to Bayesian Optimization,” 2021 Winter Simulation
Conference (WSC), Phoenix, AZ, USA, pp. 1-16, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Xilu Wang et al., “Recent
Advances in Bayesian Optimization,” ACM Computing Surveys, vol. 55, no.
13S, pp. 1-36, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Stuart J. Blair, Yaxin Bi,
and Maurice D. Mulvenna, “Aggregated Topic Models for Increasing Social Media
Topic Coherence,” Applied Intelligence, vol. 50, no. 1, pp. 138-156,
2019.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Daniel Maier et al.,
“Applying LDA Topic Modeling in Communication Research: Toward a Valid and
Reliable Methodology,” Communication Methods and Measures, vol. 12, no.
2-3, pp. 93-118, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Neha, Latent Dirichlet
Allocation (LDA) and Topic Modeling using Gensim and Sklearn, 2025. [Online].
Available: https://www.analyticsvidhya.com/blog/2021/06/part-2-topic-modeling-and-latent-dirichlet-allocation-lda-using-gensim-and-sklearn/
[27] Harshit Agarwal, Topic
Modelling Using LDA and LSA with Python Implementation, Enjoyalgorithms.Com,
2022. [Online]. Available:
https://www.enjoyalgorithms.com/blog/topic-modelling-using-lda-lsa
[28] Ihsan Ahsanu Amala, Donni
Richasdy, and Mahendra Dwifebri Purbolaksono, “Telkom University News Topic
Modeling using Latent Semantic Analysis (LSA) Method on Online News Portal,” Building
of Informatics, Technology and Science (BITS), vol. 4, no. 1, pp. 110-115,
2022.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Hamed Jelodar et al., “Deep
Sentiment Classification and Topic Discovery on Novel Coronavirus or Covid-19
Online Discussions: NLP using LSTM Recurrent Neural Network Approach,” IEEE
Journal of Biomedical and Health Informatics, vol. 24, no. 10, pp.
2733-2742, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Ishaani Priyadarshini, and
Chase Cotton, “A Novel LSTM-CNN Grid Search based Deep Neural Network for
Sentiment Analysis,” The Journal of Supercomputing, vol. 77, no. 12, pp.
13911-13932, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[31] Mohammad Ehsan Basiri et
al., “ABCDM: An Attention-based Bidirectional CNN-RNN Deep Model for Sentiment
Analysis,” Future Generation Computer System, vol. 115, pp. 279-294,
2021.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Minyong Shi, Jingyi Huang,
and Chunfang Li, “Entity Relationship Extraction based on BLSTM Model,” 2019
IEEE/ACIS 18th International Conference on Computer and Information
Science (ICIS), Beijing, China, pp. 266-269, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Beakcheol Jang et al.,
“Bi-LSTM Model to Increase Accuracy in Text Classification: Combining Word2vec
CNN and Attention Mechanism,” Applied Sciences, vol. 10, no. 17, pp.
1-14, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[34] Feng Yi, Bo Jiang, and
Jianjun Wu, “Topic Modeling for Short Texts via Word Embedding and Document
Correlation,” IEEE Access, vol. 8, pp. 30692-30705, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[35] Muhammad Inaam ul haq, and
Qianmu Li, “Revealing the Trends in the Academic Landscape of the Health Care
System using Contextual Topic Modelling,” Data Intelligence, vol. 5, no.
4, pp. 923-946, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[36] Amna Meddeba, and Lotfi Ben
Romdhane, “Using Topic Modeling and Word Embedding for Topic Extraction in
Twitter,” Procedia Computer Science, vol. 207, pp. 790-799, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[37] Rahul Kumar Gupta et al.,
“Prediction of Research Trends using LDA based Topic Modeling,” Global
Transitions Proceedings, vol. 3, no. 1, pp. 298-304, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[38] S. Sivanandham et al., “Analysing Research
Trends using Topic Modelling and Trend Prediction,” Soft Computing and
Signal Processing: Proceedings of 3rd ICSCSP, Springer,
Singapore, vol 1325, pp. 157-166, 2021.
[Google
Scholar] [Publisher Link]