An Intuitive Approach with IAT Model for Image Captioning and Labelling Analysis Using Light and Deep CNN

An Intuitive Approach with IAT Model for Image Captioning and Labelling Analysis Using Light and Deep CNN

	© 2025 by IJETT Journal
	Volume-73 Issue-7
	Year of Publication : 2025
	Author : V. Chandra Sekhar Reddy, S. Jessica Saritha
	DOI : 10.14445/22315381/IJETT-V73I7P130

How to Cite?
V. Chandra Sekhar Reddy, S. Jessica Saritha, "An Intuitive Approach with IAT Model for Image Captioning and Labelling Analysis Using Light and Deep CNN," International Journal of Engineering Trends and Technology, vol. 73, no. 7, pp.383-401, 2025. Crossref, https://doi.org/10.14445/22315381/IJETT-V73I7P130

Abstract
In image analysis, image captioning is very essential, and its emphasis is laid on functional and spatial aspects modeling of image by use of intuitive models. Based on the Flicker-8k dataset, the proposed model improves the Invasive Augmented Transform (IAT) model based on Deep CNN as a framework. The suggested methodology encompasses different building blocks such as GAN, LSTM, Oz-Net, and Inception-Net, giving a testing precision of 93%. In an attempt to address the computing requirements of bigger samples and proliferated models of IAT of 18- and 36-layers with enhanced accuracy of 99%. Compared with the 18-layer model, which is optimized in terms of training efficiency (97 percent accuracy), in the 36-layer model, the added complexity and accuracy are 2 percent. The IAT model uniquely augments the text with images to represent intricate processes, utilizing segmentation filters to refine caption coherence. The performance of the trials for the proposed model demonstrates that the IAT algorithm surpasses state-of-the-art architectures by 6% in accuracy and reduces execution time by 30%, showcasing impressive performance metrics like accuracy and BLEU for image labeling.

Keywords
Image labelling, Deep Learning, Convolutional Neural Networks, LSTM, Invasive Augmented Transform.

References
[1] Jie Guo et al., “TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning,” Remote Sensing, vol. 16, no. 11, pp. 1-19, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[2] Mingzhang Cui, Caihong Li, and Yi Yang, “Explicit Image Caption Reasoning: Generating Accurate and Informative Captions for Complex Scenes with LMM,” Sensors, vol. 24, no. 12, pp. 1-21, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[3] Jinzhi Zhang et al., “An Enhanced Feature Extraction Framework for Cross-Modal Image-Text Retrieval,” Remote Sensing, vol. 16, no. 12, pp. 1-18, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[4] Qimin Cheng, Yuqi Xu, and Ziyang Huang, “VCC-DiffNet: Visual Conditional Control Diffusion Network for Remote Sensing Image Captioning,” Remote Sensing, vol. 16, no. 16, pp. 1-17, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[5] Mateusz Bartosiewicz, and Marcin Iwanowski, “The Optimal Choice of the Encoder-Decoder Model Components for Image Captioning,” Information, vol. 15, no. 8, pp. 1-26, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[6] Lizhi Pan et al., “Military Image Captioning for Low-Altitude UAV or UGV Perspectives,” Drones, vol. 8, no. 9, pp. 1-20, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[7] Wen-Ta Hsiao et al., “Preliminary Study on Image Captioning for Construction Hazards,” Engineering Proceedings, vol. 74, no. 1, pp. 1-7, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[8] An Zhao et al., “Enhanced Transformer for Remote-Sensing Image Captioning with Positional-Channel Semantic Fusion,” Electronics, vol. 13, no. 18, pp. 1-17, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[9] Fengzhi Zhao et al., “Meshed Context-Aware Beam Search for Image Captioning,” Entropy, vol. 26, no. 10, pp. 1-22, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[10] Fengzhi Zhao et al., “Image Captioning Based on Semantic Scenes,” Entropy, vol. 26, no. 10, pp. 1-20, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[11] Mingliang Zhang et al., “DecoupleCLIP: A Novel Cross-Modality Decouple Model for Painting Captioning,” Electronics, vol. 13, no. 21, pp. 1-16, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[12] Yunpeng Li et al., “A Patch-Level Region-Aware Module with a Multi-Label Framework for Remote Sensing Image Captioning,” Remote Sensing, vol. 16, no. 21, pp. 1-20, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[13] Yue Yang et al., “Remote Sensing Image Change Captioning Using Multi-Attentive Network with Diffusion Model,” Remote Sensing, vol. 16, no. 21, pp. 1-18, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[14] Ke Zhang, Peijie Li, and Jianqiang Wang, “A Review of Deep Learning-Based Remote Sensing Image Caption: Methods, Models, Comparisons and Future Directions,” Remote Sensing, vol. 16, no. 21, pp. 1-45, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[15] Sabina Umirzakova et al., “MIRA-CAP: Memory-Integrated Retrieval-Augmented Captioning for State-of-the-Art Image and Video Captioning,” Sensors, vol. 24, no. 24, pp. 1-25, 2024.
[CrossRef] [Google Scholar] [Publisher Link]
[16] Minjun Cho et al., “Enhanced BLIP-2 Optimization Using LoRA for Generating Dashcam Captions,” Applied Sciences, vol. 15, no. 7, pp. 1-19, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[17] Zhang Guo et al., “Attribute-Based Learning for Remote Sensing Image Captioning in Unseen Scenes,” Remote Sensing, vol. 17, no. 7, pp. 1-22, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[18] Yuting He, and Zetao Jiang, “DMFormer: Dense Memory Linformer for Image Captioning,” Electronics, vol. 14, no. 9, pp. 1-22, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[19] Cheng-Si He et al., “Image Descriptions for Visually Impaired Individuals to Locate Restroom Facilities,” Engineering Proceedings, vol. 92, no. 1, pp. 1-8, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[20] Chifaa Sebbane, Ikram Belhajem, and Mohammed Rziza, “Making Images Speak: Human-Inspired Image Description Generation,” Information, vol. 16, no. 5, pp. 1-33, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[21] Jothi Prakash Venugopal et al., “DCAT: A Novel Transformer-Based Approach for Dynamic Context-Aware Image Captioning in the Tamil Language,” Applied Sciences, vol. 15, no. 9, pp. 1-38, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[22] Jun Li et al., “PBC-Transformer: Interpreting Poultry Behavior Classification Using Image Caption Generation Techniques,” Animals, vol. 15, no. 11, pp. 1-28, 2025.
[CrossRef] [Google Scholar] [Publisher Link]
[23] Binqiang Wang et al., “Retrieval Topic Recurrent Memory Network for Remote Sensing Image Captioning,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 13, pp. 256-270, 2020.
[CrossRef] [Google Scholar] [Publisher Link]
[24] Xiutiao Ye et al., “A Joint-Training Two-Stage Method for Remote Sensing Image Captioning,” IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1-16, 2022.
[CrossRef] [Google Scholar] [Publisher Link]
[25] Xiaofeng Ma, Rui Zhao, and Zhenwei Shi, “Multiscale Methods for Optical Remote-Sensing Image Captioning,” IEEE Geoscience and Remote Sensing Letters, vol. 18, no. 11, pp. 2001-2005, 2021.
[CrossRef] [Google Scholar] [Publisher Link]

IJBTT

An Intuitive Approach with IAT Model for Image Captioning and Labelling Analysis Using Light and Deep CNN