AttenTAVO-Cap: A Hybrid Deep Learning and 
Metaheuristic Approach for Image Captioning

Chengamma Chitteti; K. Reddy Madhavi

doi:https://doi.org/10.14445/22315381/IJETT-V74I1P126

Research Article | Open Access | Download PDF

Volume 74 | Issue 1 | Year 2026 | Article Id. IJETT-V74I1P126 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I1P126

AttenTAVO-Cap: A Hybrid Deep Learning and Metaheuristic Approach for Image Captioning

Chengamma Chitteti, K. Reddy Madhavi

Received	Revised	Accepted	Published
04 Aug 2025	17 Nov 2025	25 Nov 2025	14 Jan 2026

Citation :

Chengamma Chitteti, K. Reddy Madhavi, "AttenTAVO-Cap: A Hybrid Deep Learning and Metaheuristic Approach for Image Captioning," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 1, pp. 333-354, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I1P126

Abstract

Image captioning, a problem at the intersection of natural language processing and computer vision, remains a difficult problem due to the inherent challenge in converting visual semantics to semantically rich text descriptions. Metaheuristic optimization in combination with neural network architectures has recently been shown to have excellent potential in bridging this gap. In this work, we present AttenTAVO-Cap, a novel hybrid image captioning model integrating an Attention-based Convolutional Neural Network (CNN) and Bi-directional Gated Recurrent Unit (Bi-GRU) architecture with the recently proposed Taylor African Vulture Optimization (TAVO) algorithm. The TAVO algorithm, inspired by African vultures’ cooperative hunting behavior and augmented by Taylor series convergence properties, is utilized to optimize model hyperparameters very effectively. To completely assess the performance, experiments were conducted on two benchmark standards, Flickr8k and Flickr30k, with three versions of optimizers: TAVO, Genetic Algorithm (GA), and Particle Swarm Optimization (PSO). The outcome validated that AttenTAVO-Cap (TAVO) performed better than all the other models on a suite of evaluation metrics overall, with a BLEU-4 score of 0.29, METEOR of 38, and CIDEr of 194 and ROUGE-L of 67 on the Flickr8k corpus, and 0.29, 35, 191, and 63, respectively, on Flickr30k. Compared to baseline approaches, such as HABGRU + AVOA, the approach outlined here made considerable improvements, especially in semantic alignment and human-consensus based measures. Results exhibit that hybrid Deep Learning (DL) and nature-inspired optimization can produce captions that are more accurate and human-like. Additionally, the present study provides possibilities to explore the explainability and generalizability of captioning models.

Keywords

Deep Learning, Flickr8k, Flickr30k, Genetic Algorithm (GA), Image Captioning, Metaheuristic Optimization, RoBERTa Embeddings, Taylor-African Vulture Optimization Algorithm (TAVO), Particle Swarm Optimization (PSO), Neural Architecture Optimization, Visual Attention, Bidirectional LSTM (BiLSTM).

References

[1] Jiajun Du et al., “Attend More Times for Image Captioning,” arXiv Preprint, pp. 1-8, 2018.
[CrossRef] [Google Scholar] [Publisher Link]

[2] Roberto Castro et al., “Deep Learning Approaches based on Transformer Architectures for Image Captioning Tasks,” IEEE Access, vol. 10, pp. 33679-33694, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[3] Jun Yu et al., “Multimodal Transformer with Multi-View Visual Representation for Image Captioning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 12, pp. 4467-4480, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[4] Feicheng Huang et al., “Boost Image Captioning with Knowledge Reasoning,” Machine Learning, vol. 109, no. 12, pp. 2313-2332, 2020.
[CrossRef] [Google Scholar] [Publisher Link]

[5] Zhihong Zeng, and Xiaowen Li, “Application of Human Computing in Image Captioning Under Deep Learning,” Microsystem Technologies, vol. 27, no. 4, pp. 1687-1692, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

[6] Matteo Stefanini et al., “From Show to Tell: A Survey on Deep Learning-based Image Captioning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 539-559, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[7] Xiaowei Hu et al., “Scaling Up Vision-Language Pre-Training for Image Captioning,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 17980-17989, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[8] Yehao Li et al., “Comprehending and Ordering Semantics for Image Captioning,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 17969-17978, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[9] Mithilesh Mahajan et al., “Image Captioning-A Comprehensive Encoder-Decoder Approach on Flickr8K,” 2025 International Conference on Automation and Computation (AUTOCOM), Dehradun, India, pp. 1310-1315, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[10] Valavala S.S.S.R. Manikumar, and G. Bharathi Mohan, “Comparative Study of Deep Learning Algorithms for Image Caption Generation,” Proceedings of the International Conference on Advances and Applications in Artificial Intelligence (ICAAAI 2025), pp. 160-178, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[11] Ahmad Maaz et al., “VGG Models in Image Captioning: Which Architecture Delivers Better Descriptions?,” 2024 18^th International Conference on Open Source Systems and Technologies (ICOSST), Lahore, Pakistan, pp. 1-6, 2024.
[CrossRef] [Google Scholar] [Publisher Link]

[12] Jianjun Xia et al., “Research on Image Tibetan Caption Generation Method Fusion Attention Mechanism,” 2023 IEEE 4^th International Conference on Pattern Recognition and Machine Learning (PRML), Urumqi, China, pp. 193-198, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[13] Duy Thuy Thi Nguyen, and Hai Thanh Nguyen, Image Caption Generator with a Combination Between Convolutional Neural Network and Long Short-Term Memory, Biomedical and Other Applications of Soft Computing, Springer, Cham, pp. 225-238, 2022.
[CrossRef] [Google Scholar] [Publisher Link]

[14] Liming Xu et al., “Deep Image Captioning: A Review of Methods, Trends and Future Challenges,” Neurocomputing, vol. 546, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[15] Khang Nhut Lam et al., “Vision Transformer and Bidirectional RoBERTa: A Hybrid Image Captioning Model Between VirTex and CPTR,” International Advanced Computing Conference, Kolhapur, India, vol. 1781, pp. 124-137, 2023.
[CrossRef] [Google Scholar] [Publisher Link]

[16] Christian Szegedy et al., “Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, pp. 4278-4284, 2017.
[CrossRef] [Google Scholar] [Publisher Link]

[17] P. Hemashree et al., “Recuperating Image Captioning with Genetic Algorithm and Red Deer Optimization: A Comparative Study,” International Conference on Data Science and Applications, Jaipur, India, vol. 821, pp. 375-385, 2024. [CrossRef] [Google Scholar] [Publisher Link]

[18] James Kennedy, and Russell C. Eberhart, “Particle Swarm Optimization,” Proceedings of ICNN'95 - International Conference on Neural Networks, Perth, WA, Australia, vol. 4, pp. 1942-1948, 1995.
[CrossRef] [Google Scholar] [Publisher Link]

[19] Tariq Shahzad et al., “Mamba-Caption: Long-Range Sequence Modelling for Efficient and Accurate Image Captioning,” Array, vol. 28, pp. 1-13, 2025.
[CrossRef] [Google Scholar] [Publisher Link]

[20] Madhvi Patel et al., “Enhanced Image Captioning with Advanced Context-Aware Object Relational Model,” Discover Computing, vol. 28, no. 1, pp.1-27, 2025.
[CrossRef] [Google Scholar] [Publisher Link]