Contrastive Bidirectional Cross-Modal Attention Framework for Enhanced Multimodal Sentiment Analysis

Prashant Adakane; Amit Gaikwad

doi:https://doi.org/10.14445/22315381/IJETT-V74I6P124

Research Article | Open Access | Download PDF

Volume 74 | Issue 6 | Year 2026 | Article Id. IJETT-V74I6P124 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I6P124

Contrastive Bidirectional Cross-Modal Attention Framework for Enhanced Multimodal Sentiment Analysis

Prashant Adakane, Amit Gaikwad

Received	Revised	Accepted	Published
19 Jan 2026	07 Apr 2026	20 Apr 2026	27 Jun 2026

Citation :

Prashant Adakane, Amit Gaikwad, "Contrastive Bidirectional Cross-Modal Attention Framework for Enhanced Multimodal Sentiment Analysis," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 6, pp. 343-364, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I6P124

Abstract

The high rate of social media content development causes an increase in multimodal data, such that modeling relationships between visual and textual data is challenging. Nevertheless, most of the available methods cannot capture fine-grained text-to-visual or visual-to-text interaction, resulting in lower sentiment performance. A Contrastive Bidirectional Cross-Modal Attention (C-BCMA) model is presented to enhance the correspondence of textual and visual representations by acquiring a common latent space. An attention method inspired by CLIP is utilized to produce robust cross-modal latent features to enhance their joint representation. Textual features are derived using ALBERT, whereas EfficientNet-B2 is applied to obtain visual representations. Interactions between modalities are learned using a multi-head attention mechanism. Textual and visual information is handled jointly during learning. This helps reduce gaps between the two modalities. This enables the model to process various semantic cues at once. Contrastive learning is used in the model to align similar text-image pairs and to separate unrelated text-image pairs so that better multimodal representations are achieved. The model has a better performance than baseline approaches on both single and multiple annotation versions of MVSA datasets. It achieves better performance across various evaluation metrics. Less obvious expressions like sarcasm and implicit sentiment are handled more effectively in this work, improving interpretation in multimodal sentiment analysis of social media data.

Keywords

Albert, Clip, Cross-Modal Attention, Efficientnet-B2, Feature Fusion, Multimodal Sentiment Analysis.

References

[1] ChangPeng Ji, TianYu Tan, and Wei Dai, “Multimodal Sentiment Analysis based on Temporal Perception and Cross-Modal İnteraction,” Multimedia Systems, vol. 31, no. 5, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[2] Uttam U. Deshpande et al., “Multimodal Sentiment Analysis using İmage and Text Fusion for Emotion Detection,” Discover Computing, vol. 28, no. 1, pp. 1-24, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[3] Huixin Wu, and Yang Zang, “A Multi-Scale Adaptive Fusion Model for Multimodal Sarcasm Detection,” Discover Computing, vol. 28, no. 1, pp. 1-22, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[4] Jacob Devlin et al., “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding,” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Minneapolis, Minnesota, vol. 1, pp. 4171-4186, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[5] Zhenzhong Lan et al., “ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations,” arXiv preprint, pp. 1-17, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[6] Bengong Yu, Chenyue Li, and Zhongyu Shi, “Multi-Grained Feature Gating Fusion Network for Multimodal Sentiment Analysis,” Knowledge and Information Systems, vol. 67, no. 8, pp. 6879-6905, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[7] Rahma Ghorbel, Hanen Ameur, and Yassine Ben Ayed, “Cross-Attention-Enhanced Multimodal Fake News Detection using Autoencoder-based Fusion and Transformer-based Models,” Procedia Computer Science, vol. 270, pp. 4044-4053, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[8] Nan Wang, and Qi Wang, “Dynamic Weighted Gating for Enhanced Cross-Modal Interaction in Multimodal Sentiment Analysis,” ACM Transactions on Multimedia Computing Communications and Applications, vol. 21, no. 1, pp. 1-19, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[9] Wang Guo et al., “CMDAF: Cross-Modality Dual-Attention Fusion Network for Multimodal Sentiment Analysis,” Applied Sciences, vol. 14, no. 24, pp. 1-14, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[10] Lorenzo Vaiani et al., “Cross-Modal Consistency Types in Multimodal Social Data,” Knowledge-based Systems, vol. 322, pp. 1-12, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[11] Lingli Yu, and Ling Yang, “News Media in Crisis: A Sentiment and Emotion Analysis of US News Articles on Unemployment in the COVID-19 Pandemic,” Humanities and Social Sciences Communications, vol. 11, no. 1, pp. 1-9, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[12] Jiangao Deng, and Yue Liu, “Research on Sentiment Analysis of Online Public Opinion based on RoBERTa-BiLSTM-Attention Model,” Applied Sciences, vol. 15, no. 4, pp. 1-20, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[13] Imad Zyout, and Mo’ath Zyout, “Sentiment Analysis of Student Feedback using Attention-based RNN and Transformer Embedding,” IAES International Journal of Artificial Intelligence, vol. 13, no. 2, pp. 2173-2184, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[14] Jing You et al., “Sentiment Analysis Method of Consumer Reviews based on Multi-Modal Feature Mining,” International Journal of Cognitive Computing in Engineering, vol. 6, pp. 143-151, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[15] Garvit Ahuja, Alireza Alaei, and Umapada Pal, “A New Multimodal Sentiment Analysis for İmages Containing Textual İnformation,” Multimedia Tools and Applications, vol. 84, no. 21, pp. 23745-23774, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[16] Ashima Yadav, and Dinesh Kumar Vishwakarma, “A Deep Multi-Level Attentive Network for Multimodal Sentiment Analysis,” ACM Transactions on Multimedia Computing Communications and Applications, vol. 19, no. 1, pp. 1-19, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[17] Fei Zhao, Chengcui Zhang, and Baocheng Geng, “Deep Multimodal Data Fusion,” ACM Computing Surveys, vol. 56, no. 9, pp. 1-36, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[18] Anitha Balachandran, and Mohammad Masum, “A Multimodal Framework for Enhancing E-Commerce İnformation Management using Vision Transformers and Large Language Models,” International Journal of Information Management Data Insights, vol. 5, no. 2, pp. 1-17, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[19] HaiLong Wang et al., “A Method for Multimodal Sentiment Analysis: Adaptive İnteraction and Multi-Scale Fusion,” Journal of Intelligent Information Systems, vol. 63, no. 5, pp. 1667-1686, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[20] Jie Wang et al., “CiteNet: Cross-Modal İncongruity Perception Network for Multimodal Sentiment Prediction,” Knowledge-based Systems, vol. 295, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[21] Devamanyu Hazarika, Roger Zimmermann, and Soujanya Poria, “MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis,” Proceedings of the 28^th ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, United States, pp. 1122-1131, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[22] Wasifur Rahman et al., “Integrating Multimodal Information in Large Pretrained Transformers,” Proceedings of the 58^th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, pp. 2359-2369, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[23] Jian Kim et al., “Leveraging Dynamic Feature Fusion of Self and Cross-Attention for Robust Multimodal Emotion Recognition,” ICT Express, vol. 12, no. 2, pp. 306-310, 2026.
[CrossRef] [Google Scholar] [Publisher Link]

[24] Xuejian Huang et al., “An Effective Multimodal Representation and Fusion Method for Multimodal İntent Recognition,” Neurocomputing, vol. 548, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[25] Chuanming Yu et al., “BCMF: A Bidirectional Cross-Modal Fusion Model for Fake News Detection,” Information Processing and Management, vol. 59, no. 5, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[26] Hao Tan, and Mohit Bansal, “LXMERT: Learning Cross-Modality Encoder Representations from Transformers,” Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9^th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics, Hong Kong, China, pp. 5100-5111, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[27] Chao Jia et al., “Scaling up Visual and Vision-Language Representation Learning with Noisy Text Supervision (ALIGN),” arXiv Preprint, pp. 1-13, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[28] Ting Chen et al., “A Simple Framework for Contrastive Learning of Visual Representations,” Proceedings of the 37^th International Conference on Machine Learning, PMLR, pp. 1597-1607, 2020.
[Google Scholar] [Publisher Link]

[29] Shafna Fitria Nur Azizah et al., “Performance Analysis of Transformer based Models (BERT, ALBERT, and RoBERTa) in Fake News Detection,” 2023 6^th International Conference on Information and Communications Technology (ICOIACT), Yogyakarta, Indonesia, pp. 425-430, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[30] M.D. Aaseegha, and B. Venkataramana, “A Hybrid Framework for Enhanced Segmentation and Classification of Colorectal Cancer Histopathology,” Frontiers in Artificial Intelligence, vol. 8, pp. 1-18, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[31] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton, “Layer Normalization,” arXiv preprint, pp. 1-14, 2016.
[CrossRef] [Google Scholar] [Publisher Link]

[32] Ashish Vaswani et al., “Attention is all You Need,” NeurIPS Proceedings Advances in Neural Information Processing Systems, vol. 30, 2017.
[Google Scholar] [Publisher Link]

[33] Ilya Loshchilov, and Frank Hutter, “Decoupled Weight Decay Regularization,” arXiv preprint, pp. 1-17, 2017.
[CrossRef] [Google Scholar] [Publisher Link]

[34] MVSA: Sentiment Analysis on Multi-View Social Data, MCR Lab, 2016. [Online]. Available: https://mcrlab.net/research/mvsa-sentiment-analysis-on-multi-view-social-data/

[35] Huiru Wang et al., “Multimodal Sentiment Analysis Representations Learning via Contrastive Learning with Condense Attention Fusion,” Sensors, vol. 23, no. 5, pp. 1-15, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[36] Nan Xu, and Wenji Mao, “MultiSentiNet: A Deep Semantic Network for Multimodal Sentiment Analysis,” Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Association for Computing Machinery, New York, NY, United States, pp. 2399-2402, 2017.
[CrossRef] [Google Scholar] [Publisher Link]

[37] Jun Du et al., “Hierarchical Graph Contrastive Learning of Local and Global Presentation for Multimodal Sentiment Analysis,” Scientific Reports, vol. 14, no. 1, pp. 1-13, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[38] Douwe Kiela et al., “Supervised Multimodal Bitransformers for Classifying İmages and Text,” arXiv Preprint, pp. 1-11, 2019.
[CrossRef] [Google Scholar] [Publisher Link]

[39] Alec Radford et al., “Learning Transferable Visual Models from Natural Language Supervision,” Proceedings of the 38^th International Conference on Machine Learning, PMLR, pp. 8748-8763, 2021.
[Google Scholar] [Publisher Link]

[40] Teng Niu et al., “Sentiment Analysis on Multi-View Social Data,” MultiMedia Modeling: 22^nd International Conference, MMM 2016,
Miami, FL, USA, vol. 9517, pp. 15-27, 2016.
[CrossRef] [Google Scholar] [Publisher Link]