GuHPD: A Transformer-Driven Approach to Hostile Post Detection in Gujarati

Jagruti Boda; Keyur Rana

doi:https://doi.org/10.14445/22315381/IJETT-V73I7P109

Research Article | Open Access | Download PDF

Volume 73 | Issue 7 | Year 2025 | Article Id. IJETT-V73I7P109 | DOI : https://doi.org/10.14445/22315381/IJETT-V73I7P109

GuHPD: A Transformer-Driven Approach to Hostile Post Detection in Gujarati

Jagruti Boda, Keyur Rana

Received	Revised	Accepted	Published
22 Mar 2025	13 Jun 2025	16 Jun 2025	30 Jul 2025

Citation :

Jagruti Boda, Keyur Rana, "GuHPD: A Transformer-Driven Approach to Hostile Post Detection in Gujarati," International Journal of Engineering Trends and Technology (IJETT), vol. 73, no. 7, pp. 85-104, 2025. Crossref, https://doi.org/10.14445/22315381/IJETT-V73I7P109

Abstract

In recent years, social media has become a prominent medium for public expression; however, it is increasingly exploited to disseminate hostility, particularly against individuals, communities, and religious groups. Religious hate speech can cause profound societal and psychological harm, underscoring the urgent need for automated detection systems. While considerable progress has been made in English-language hate speech detection, limited efforts have addressed low-resource languages such as Gujarati. To bridge this gap, this study presents the Gujarati Hostile Posts Detection (GuHPD) dataset, comprising approximately 14,800 manually annotated comments aimed at identifying hostile content in Gujarati. The dataset supports two core tasks: (i) binary classification to differentiate hostile and non-hostile posts, and (ii) multi-class classification to identify hostile subtypes, including hate speech, fake news, defamation, and offensive language. Annotation reliability was assessed using Fleiss' Kappa, indicating substantial agreement. Several transformer-based models were evaluated, with Multilingual BERT demonstrating the highest performance, achieving an accuracy of 0.93 for binary classification and 0.78 for multi-class classification. These findings demonstrate the utility of the GuHPD dataset in advancing hostile content detection for underrepresented languages and provide a benchmark for future research in regional NLP applications.

Keywords

Coarse-grained text classification, Deep Learning, Fine-grained text classification, Gujarati dataset, Hostile post.

References

[1] Shazia Sajid et al., “Investigating how Cultural Contexts Shape Social Media Experiences and their Emotional Consequence,” Review of Education, Administration and Law, vol. 7, no. 4, pp. 185-200, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Antoine Bordes, Léon Bottou, and Patrick Gallinari, “SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent,” Journal of Machine Learning Research, vol. 10, no. 59, pp. 1737-1754, 2009.
[Google Scholar] [Publisher Link]
[3] Hasan Beyari, “The Relationship Between Social Media and the Increase in Mental Health Problems,” International Journal of Environmental Research and Public Health, vol. 20, no. 3, pp. 1-11, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Tamara-Jade Kaz, “Myanmar: Facebook’s Systems Promoted Violence Against Rohingya-Meta Owes Reparations,” Amnesty International, 2022.
[Google Scholar] [Publisher Link]
[5] Varad Bhatnagar, Prince Kumar, and Pushpak Bhattacharyya, “Investigating Hostile Post Detection in Hindi,” Neurocomputing, vol. 474, pp. 60-81, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Raza Ali et al., “Hate Speech Detection on Twitter Using Transfer Learning,” Computer Speech & Language, vol. 74, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Ahmed Cherif Mazari, Nesrine Boudoukhani, and Abdelhamid Djeffal, “BERT-Based Ensemble Learning for Multi-Aspect Hate Speech Detection,” Cluster Computing, vol. 27, no. 1, pp. 325-339, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Amit Praseed, Jelwin Rodrigues, and P. Santhi Thilagam, “Hindi Fake News Detection Using Transformer Ensembles,” Engineering Applications of Artificial Intelligence, vol. 119, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Ramchandra Joshi et al., “Evaluation of Deep Learning Models for Hostility Detection in Hindi Text,” 2021 6th International Conference for Convergence in Technology (I2CT), Maharashtra, India, pp. 1-5, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Anushka Sharma, and Rishabh Kaushal, “Detecting Hate Speech in Hindi in Online Social Media,” 2023 3rd International Conference on Intelligent Communication and Computational Techniques (ICCT), Jaipur, India, pp. 1-5, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Mohit Bhardwaj et al., “HostileNet: Multilabel Hostile Post Detection in Hindi,” IEEE Transactions on Computational Social Systems, vol. 11, no. 2, pp. 1842-1852, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Chander Shekhar et al., “Walk in Wild: An Ensemble Approach for Hostility Detection in Hindi Posts,” arXiv Preprint, pp. 1-10, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Saurabh R. Sangwan, and M.P.S. Bhatia, “Denigrate Comment Detection in Low-Resource Hindi Language Using Attention-Based Residual Networks,” Transactions on Asian and Low-Resource Language Information Processing, vol. 21, no. 1, pp. 1-14, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Farhan Ahmad Jafri et al., “Uncovering Political Hate Speech During Indian Election Campaign: A New Low-Resource Dataset and Baselines,” arXiv Preprint, pp. 1-5, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Angana Chakraborty, Subhankar Joardar, and Arif Ahmed Sekh, “Ensemble Classifier for Hindi Hostile Content Detection,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 23, no. 1, pp. 1-17, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Taehyeon Kim et al., “Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation,” Proceedings of the 30th International Joint Conference on Artificial Intelligence, pp. 2628-2636, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Deepawali Sharma, Vivek Kumar Singh, and Vedika Gupta, “TABHATE: A Target-Based Hate Speech Detection Dataset in Hindi,” Social Network Analysis and Mining, vol. 14, no. 1, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Arpan Nandi et al., “A Survey of Hate Speech Detection in Indian Languages,” Social Network Analysis and Mining, vol. 14, no. 1, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Fatima-zahra El-Alami, Said Ouatik El Alaoui, and Noureddine En Nahnahi, “A Multilingual Offensive Language Detection Method Based on Transfer Learning from Transformer Fine-Tuning Model,” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 8, pp. 6048-6056, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Zaki Mustafa Farooqi, Sreyan Ghosh, and Rajiv Ratn Shah, “Leveraging Transformers for Hate Speech Detection in Conversational Code-Mixed Tweets,” arXiv Preprint, pp. 1-12, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Benjamin Muller et al., “First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT,” Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 2214-2231, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Satyajit Kamble, and Aditya Joshi, “Hate Speech Detection from Code-Mixed Hindi-English Tweets Using Deep Learning Models,” arXiv Preprint, pp. 1-6, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Arushi Sharma, Anubha Kabra, and Minni Jain, “Ceasing Hate with MOH: Hate Speech Detection in Hindi-English Code-Switched Language,” Information Processing & Management, vol. 59, no. 1, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Tharindu Ranasinghe, and Marcos Zampieri, “Multilingual Offensive Language Identification for Low-Resource Languages,” Transactions on Asian and Low-Resource Language Information Processing, vol. 21, no. 1, pp. 1-13, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Tanmay Chavan et al., “A Twitter BERT Approach for Offensive Language Detection in Marathi,” arXiv Preprint, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[26] Abhishek Velankar, Hrushikesh Patil, and Raviraj Joshi, “Mono vs Multilingual BERT for Hate Speech Detection and Text Classification: A Case Study in Marathi,” IAPR Workshop on Artificial Neural Networks in Pattern Recognition, Dubai, United Arab Emirates, pp. 121-128, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[27] Onkar Litake et al., “Mono Versus Multilingual BERT: A Case Study in Hindi and Marathi Named Entity Recognition,” Proceedings of 3rd International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications, pp. 607-618, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[28] Prachi Shedge, Siddhi Kamalkar, and Deepa Gupta, “Hate Speech Detection in Marathi Tweets Using Stacked Deep Learning Models,” 2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT), Kamand, India, pp. 639-650, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[29] Arpan Nandi et al., “Combining Multiple Pre-Trained Models for Hate Speech Detection in Bengali, Marathi, and Hindi,” Multimedia Tools and Applications, vol. 83, pp. 77733-77757, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[30] Abhishek Velankar et al., “Hate and Offensive Speech Detection in Hindi and Marathi,” arXiv Preprint, pp. 1-9, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[31] Abhishek Velankar et al., “L3Cube-Mahahate: A Tweet-Based Marathi Hate Speech Detection Dataset and BERT Models,” Proceedings of the 3rd Workshop on Threat, Aggression and Cyberbullying (TRAC 2022), Gyeongju, Republic of Korea, pp. 1-9, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[32] Bhargav Chhaya et al., “SamPar: A Marathi Hate Speech Dataset for Homophobia, Transphobia,” International Conference on Speech and Language Technologies for Low-Resource Languages, pp. 34-51, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[33] Muhammad Deedahwar Mazhar Qureshi et al., “Hate Speech Classification for Sinhalese and Gujarati,” Forum for Information Retrieval Evaluation (FIRE-Working Notes 2023), Goa, India, pp. 501-515, 2023.
[Google Scholar] [Publisher Link]
[34] Prasanna Kumar Kumaresan et al., “Dataset for Identification of Homophobia and Transphobia for Telugu, Kannada, and Gujarati,” Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia, pp. 4404-4411, 2024.
[Google Scholar] [Publisher Link]
[35] Nikhil Narayan et al., “Hate Speech and Offensive Content Detection in Indo-Aryan Languages: A Battle of LSTM and Transformers,” arXiv Preprint, pp. 1-15, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[36] Mounika Marreddy et al., “Am I a Resource-Poor Language? Data Sets, Embeddings, Models and Analysis for Four Different NLP Tasks in Telugu Language,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 1, pp. 1-34, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[37] Vimala Balakrishnan, Vithyatheri Govindan, and Kumanan N. Govaichelvan, “Tamil Offensive Language Detection: Supervised versus Unsupervised Learning Approaches,” ACM Transactions on Asian and Low-Resource Language Information Processing, vol. 22, no. 4, pp. 1-14, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[38] Koyel Ghosh et al., “Transformer-Based Hate Speech Detection in Assamese,” 2023 IEEE Guwahati Subsection Conference (GCON), Guwahati, India, pp. 1-5, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[39] G. Gnana Sai et al., “Enhancing Hate Speech Detection in Sinhala and Gujarati: Leveraging BERT Models and Linguistic Constraints,” Forum for Information Retrieval Evaluation (FIRE-Working Notes 2023), Goa, India, pp. 435-444, 2023.
[Google Scholar] [Publisher Link]
[40] Rezaul Haque et al., “Multi-Class Sentiment Classification on Bengali Social Media Comments Using Machine Learning,” International Journal of Cognitive Computing in Engineering, vol. 4, pp. 21-35, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[41] Raghad Alshaalan, and Hend Al-Khalifa, “Hate Speech Detection in Saudi Twittersphere: A Deep Learning Approach,” Proceedings of the 5th Arabic Natural Language Processing Workshop, Barcelona, Spain, pp. 12-23, 2020.
[Google Scholar] [Publisher Link]
[42] Monil Gokani, and Radhika Mamidi, “GSAC: A Gujarati Sentiment Analysis Corpus from Twitter,” Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis, Toronto, Canada, pp. 129-137, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[43] Wikipedia, Gujarati Language, Wikipedia: The Free Encyclopedia, 2002. [Online]. Available: https://en.wikipedia.org/wiki/Gujarati_language
[44] Jacob Devlin et al., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding,” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis, Minnesota, vol. 1, pp. 4171-4186, 2019.
[CrossRef] [Google Scholar] [Publisher Link]
[45] K. Sreelakshmi, B. Premjith, and K.P. Soman, “Detection of Hate Speech Text in Hindi-English Code-Mixed Data,” Procedia Computer Science, vol. 172, pp. 737-744, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[46] Rukhma Qasim et al., “A Fine-Tuned BERT-Based Transfer Learning Approach for Text Classification,” Journal of Healthcare Engineering, pp. 1-17, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[47] Peluru Janardhana Rao et al., “An Efficient Methodology for Identifying the Similarity Between Languages with Levenshtein Distance,” International Conference on Communications and Cyber Physical Engineering 2018, Hyderabad, India, pp. 161-174, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[48] Translation AI, Accelerate Global Growth with Quality Translation at Scale, Powered by Gemini, Google Cloud, 2025. [Online]. Available: https://cloud.google.com/translate?hl=en
[49] Jonas Moss, “Measures of Agreement with Multiple Raters: Fréchet Variances and Inference,” Psychometrika, vol. 89, no. 2, pp. 517-541, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[50] DistilBERT, Hugging Face, 2019. [Online]. Available: https://huggingface.co/docs/transformers/en/model_doc/distilbert
[51] Google-Bert/BERT-Base-Multilingual-Cased, Hugging Face, 2024. [Online]. Available: https://huggingface.co/google-bert/bert-base-multilingual-cased
[52] Zhilin Yang et al., XLNet, Hugging Face, 2019. [Online]. Available: https://huggingface.co/docs/transformers/en/model_doc/xlnet
[53] Pengcheng He et al., DeBERTa, Hugging Face, 2018. [Online]. Available: https://huggingface.co/docs/transformers/en/model_doc/deberta
[54] RoBERTa, Hugging Face, 2019. [Online]. Available: https://huggingface.co/docs/transformers/en/model_doc/roberta
[55] L3cube-Pune/Gujarati-Bert, Hugging Face, 2022. [Online]. Available: https://huggingface.co/l3cube-pune/gujarati-bert
[56] L3cube-Pune/Gujarati-Bert-Scratch, Hugging Face, 2022. [Online]. Available: https://huggingface.co/l3cube-pune/gujarati-bert-scratch
[57] L3cube-pune/Gujarati-Sentence-Bert-Nli, Hugging Face, 2023. [Online]. Available: https://huggingface.co/l3cube-pune/gujarati-sentence-bert-nli
[58] Budi Nugroho, and Anny Yuniarti, “Performance of Root-Mean-Square Propagation and Adaptive Gradient Optimization Algorithms on Covid-19 Pneumonia Classification,” 2022 IEEE 8th Information Technology International Seminar (ITIS), Surabaya, Indonesia, pp. 333-338, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[59] Google/Muril-Base-Cased, Hugging Face, 2018. [Online]. Available: https://huggingface.co/google/muril-base-cased
[60] L3cube-Pune/Marathi-Bert, Hugging Face, 2022. [Online]. Available: https://huggingface.co/l3cube-pune/marathi-bert
[61] Google Bert /Bert-Large-Uncased, Hugging Face, 2024. [Online]. Available: https://huggingface.co/google-bert/bert-large-uncased