Research Article | Open Access | Download PDF
Volume 74 | Issue 1 | Year 2026 | Article Id. IJETT-V74I1P126 | DOI : https://doi.org/10.14445/22315381/IJETT-V74I1P126AttenTAVO-Cap: A Hybrid Deep Learning and Metaheuristic Approach for Image Captioning
Chengamma Chitteti, K. Reddy Madhavi
| Received | Revised | Accepted | Published |
|---|---|---|---|
| 04 Aug 2025 | 17 Nov 2025 | 25 Nov 2025 | 14 Jan 2026 |
Citation :
Chengamma Chitteti, K. Reddy Madhavi, "AttenTAVO-Cap: A Hybrid Deep Learning and Metaheuristic Approach for Image Captioning," International Journal of Engineering Trends and Technology (IJETT), vol. 74, no. 1, pp. 333-354, 2026. Crossref, https://doi.org/10.14445/22315381/IJETT-V74I1P126
Abstract
Image captioning, a problem at the intersection of natural language processing and computer vision, remains a difficult problem due to the inherent challenge in converting visual semantics to semantically rich text descriptions. Metaheuristic optimization in combination with neural network architectures has recently been shown to have excellent potential in bridging this gap. In this work, we present AttenTAVO-Cap, a novel hybrid image captioning model integrating an Attention-based Convolutional Neural Network (CNN) and Bi-directional Gated Recurrent Unit (Bi-GRU) architecture with the recently proposed Taylor African Vulture Optimization (TAVO) algorithm. The TAVO algorithm, inspired by African vultures’ cooperative hunting behavior and augmented by Taylor series convergence properties, is utilized to optimize model hyperparameters very effectively. To completely assess the performance, experiments were conducted on two benchmark standards, Flickr8k and Flickr30k, with three versions of optimizers: TAVO, Genetic Algorithm (GA), and Particle Swarm Optimization (PSO). The outcome validated that AttenTAVO-Cap (TAVO) performed better than all the other models on a suite of evaluation metrics overall, with a BLEU-4 score of 0.29, METEOR of 38, and CIDEr of 194 and ROUGE-L of 67 on the Flickr8k corpus, and 0.29, 35, 191, and 63, respectively, on Flickr30k. Compared to baseline approaches, such as HABGRU + AVOA, the approach outlined here made considerable improvements, especially in semantic alignment and human-consensus based measures. Results exhibit that hybrid Deep Learning (DL) and nature-inspired optimization can produce captions that are more accurate and human-like. Additionally, the present study provides possibilities to explore the explainability and generalizability of captioning models.
Keywords
Deep Learning, Flickr8k, Flickr30k, Genetic Algorithm (GA), Image Captioning, Metaheuristic Optimization, RoBERTa Embeddings, Taylor-African Vulture Optimization Algorithm (TAVO), Particle Swarm Optimization (PSO), Neural Architecture Optimization, Visual Attention, Bidirectional LSTM (BiLSTM).
References
[1] Jiajun Du et al., “Attend More Times for Image Captioning,” arXiv
Preprint, pp. 1-8, 2018.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Roberto Castro et al., “Deep Learning Approaches
based on Transformer Architectures for Image Captioning Tasks,” IEEE Access,
vol. 10, pp. 33679-33694, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Jun Yu et al., “Multimodal Transformer with
Multi-View Visual Representation for Image Captioning,” IEEE Transactions on
Circuits and Systems for Video Technology, vol. 30, no. 12, pp. 4467-4480,
2020.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Feicheng Huang et al., “Boost Image Captioning
with Knowledge Reasoning,” Machine Learning, vol. 109, no. 12, pp.
2313-2332, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Zhihong Zeng, and Xiaowen Li, “Application of
Human Computing in Image Captioning Under Deep Learning,” Microsystem
Technologies, vol. 27, no. 4, pp. 1687-1692, 2021.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Matteo Stefanini et al., “From Show to Tell: A
Survey on Deep Learning-based Image Captioning,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 539-559,
2023.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Xiaowei Hu et al., “Scaling Up Vision-Language
Pre-Training for Image Captioning,” 2022 IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 17980-17989, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[8] Yehao Li et al., “Comprehending and Ordering
Semantics for Image Captioning,” 2022 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), New Orleans, LA, USA, pp. 17969-17978,
2022.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Mithilesh Mahajan et al., “Image Captioning-A
Comprehensive Encoder-Decoder Approach on Flickr8K,” 2025 International
Conference on Automation and Computation (AUTOCOM), Dehradun, India, pp.
1310-1315, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Valavala S.S.S.R. Manikumar, and G. Bharathi
Mohan, “Comparative Study of Deep Learning Algorithms for Image Caption
Generation,” Proceedings of the International Conference on Advances and
Applications in Artificial Intelligence (ICAAAI 2025), pp. 160-178, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Ahmad Maaz et al., “VGG Models in Image
Captioning: Which Architecture Delivers Better Descriptions?,” 2024 18th
International Conference on Open Source Systems and Technologies (ICOSST),
Lahore, Pakistan, pp. 1-6, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Jianjun Xia et al., “Research on Image Tibetan
Caption Generation Method Fusion Attention Mechanism,” 2023 IEEE 4th
International Conference on Pattern Recognition and Machine Learning (PRML),
Urumqi, China, pp. 193-198, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Duy Thuy Thi Nguyen, and Hai Thanh Nguyen, Image
Caption Generator with a Combination Between Convolutional Neural Network and
Long Short-Term Memory, Biomedical and Other Applications of Soft
Computing, Springer, Cham,
pp. 225-238, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Liming Xu et al., “Deep Image Captioning: A
Review of Methods, Trends and Future Challenges,” Neurocomputing, vol.
546, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Khang Nhut Lam et al., “Vision Transformer and
Bidirectional RoBERTa: A Hybrid Image Captioning Model Between VirTex and
CPTR,” International Advanced Computing Conference, Kolhapur, India,
vol. 1781, pp. 124-137, 2023.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Christian Szegedy et al., “Inception-v4,
Inception-ResNet and the Impact of Residual Connections on Learning,” Proceedings
of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, pp.
4278-4284, 2017.
[CrossRef] [Google Scholar] [Publisher Link]
[17] P. Hemashree et al., “Recuperating Image
Captioning with Genetic Algorithm and Red Deer Optimization: A Comparative
Study,” International Conference on Data Science and Applications, Jaipur,
India, vol. 821, pp. 375-385, 2024. [CrossRef] [Google Scholar] [Publisher Link]
[18] James Kennedy, and Russell C. Eberhart, “Particle
Swarm Optimization,” Proceedings of ICNN'95 - International Conference on
Neural Networks, Perth, WA, Australia, vol. 4, pp. 1942-1948, 1995.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Tariq Shahzad et al.,
“Mamba-Caption: Long-Range Sequence Modelling for Efficient and Accurate Image
Captioning,” Array, vol. 28, pp. 1-13, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Madhvi Patel et al.,
“Enhanced Image Captioning with Advanced Context-Aware Object Relational
Model,” Discover Computing, vol. 28, no. 1, pp.1-27, 2025.
[CrossRef] [Google Scholar] [Publisher Link]