An Intelligent Image Captioning Generator using Multi-Head Attention Transformer
How to Cite?
Jansi Rani. J, Kirubagari. B, "An Intelligent Image Captioning Generator using MultiHead Attention Transformer," International Journal of Engineering Trends and Technology, vol. 69, no. 12, pp. 267-279, 2021. Crossref, https://doi.org/10.14445/22315381/IJETT-V69I12P232
Abstract
Recently, the advancements of artificial intelligence(AI) techniques have gained significant attention among research communications. At the same time, image captioning becomes an essential process in scene understanding, which involves the automated generation of natural language explanations dependent upon the content that exists in the image. The applicability of the image captioning process becomes important. With the development of deep learning (DL) and effective labeling datasets, image captioning approaches have been presented rapidly. In this aspect, this study designs an Intelligent Image Captioning Generator (IICG) model. The proposed IICG model technique encompasses different stages of preprocessing on image captions, namely, removal of punctuation marks, removal of single-letter characters, removal of numerals, and text vectorization. Besides, the DL-based DenseNet121 model is employed for the feature extraction process of the images. Then, the image captioning process takes place using the Multi-Head Attention Layer Transformer model, which consists of multiple encoders as well as decoders. The performance validation of the presented technique occurs utilizing Flickr 8k Dataset. A detailed comparative outcomes analysis is made, and the experimental outcomes demonstrate the superior performance of the proposed model in terms of Bilingual Evaluation Understudy (BLEU) score, ROUGE (Recall-Oriented Understudy for Gisting Evaluation), METEOR (Metric for Calculation of Translation with Explicit Ordering), CIDEr (Consensus-based Image Description Evaluation).
Keywords
Image captioning, Deep learning, Adam optimizer, DenseNet121, Flickr 8k dataset
Reference
[1] P. Anderson, X. He, C. Buehler, et al., Bottom-up and topdown attention for image captioning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, (2018).
[2] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, Rethinking the inception architecture for computer vision, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2016) 2818–2826.
[3] H. Fang, S. Gupta, F. N. Iandola et al., From captions to visual concepts and back, in Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), (2015) 1473–1482.
[4] F. Hutter, L. Kotthoff, and J. Vanschoren, Automated machine learning.: methods, systems, challenges, Automated Machine Learning, MIT Press, Cambridge, MA, USA, (2019).
[5] R. Singh, A. Sonawane, and R. Srivastava, Recent evolution of modern datasets for human activity recognition: a deep survey,” Multimedia Systems, 26 (2020).
[6] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, HMDB: a large video database for human motion recognition, in Proceedings of the 2011 International Conference on Computer Vision, (2011) 2556–2563.
[7] A. Aker and R. Gaizauskas, Generating image descriptions using relational dependency patterns, in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, 49(9) (2010) 1250–1258.
[8] S. Li, G. Kulkarni, T. L. Berg, and Y. Choi, Composing simple image descriptions using web-scale N-grams, in Proceeding of Fifteenth Conference on Computational Natural Language Learning, 220–228 (2011).
[9] Y. Yang, C. L. Teo, H. Daume, and Y. Aloimonos, Corpusguided sentence generation of natural images, in Proceeding of the Conference on Empirical Methods in Natural Language Processing, (2011) 444–454.
[10] G. Kulkarni, V. Premraj, V. Ordonez, et al., Babytalk: understanding and generating simple image descriptions, IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(12) (2013) 2891–2903.
[11] Chu, Y., Yue, X., Yu, L., Sergei, M. and Wang, Z., Automatic image captioning based on ResNet50 and LSTM with soft attention. Wireless Communications and Mobile Computing, (2020).
[12] Lu, H., Yang, R., Deng, Z., Zhang, Y., Gao, G. and Lan, R., Chinese image captioning via fuzzy attention-based DenseNet-BiLSTM. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17(1s) (2021) 1-18.
[13] Zhang, J., Li, K., Wang, Z., Zhao, X., and Wang, Z., Visually enhanced gLSTM for image captioning. Expert Systems with Applications, 184 (2021) 115462.
[14] Zhong, Y., Wang, L., Chen, J., Yu, D., and Li, Y., August. Comprehensive image captioning via scene graph decomposition. In European Conference on Computer Vision (2020) 211-229.
[15] Liu, H., Wang, G., Huang, T., He, P., Skitmore, M., and Luo, X.,. Manifesting construction activity scenes via image captioning. Automation in Construction, 119 (2020) 103334.
[16] Shen, X., Liu, B., Zhou, Y., Zhao, J., and Liu, M., Remote sensing image captioning via Variational Autoencoder and Reinforcement Learning. Knowledge-Based Systems, 203 (2020) 105920.
[17] Del Chiaro, R., Twardowski, B., Bagdanov, A.D. and Van de Weijer, J., Ratt: Recurrent attention to transient tasks for continual image captioning. arXiv preprint arXiv: (2007) 06271.
[18] Guo, L., Liu, J., Zhu, X., Yao, P., Lu, S., and Lu, H., Normalized and geometry-aware self-attention network for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020) 10327-10336.
[19] Huang, G., Liu, Z., Van Der Maaten, L. and Weinberger, K.Q., Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2017) 4700-4708.
[20] Su, S.S., and Kek, S.L., An Improvement of Stochastic Gradient Descent Approach for Mean-Variance Portfolio Optimization Problem. Journal of Mathematics, (2021).
[21] https://www.analyticsvidhya.com/blog/2021/01/implementation-of-attention-mechanism-for-caption-generation-on-transformers-using-tensorflow/
[22] Luo, J. and Ma, L., Image Caption Model based on Multi-head Attention and Encoder-Decoder Framework. In 2019 IEEE 14th International Conference on Intelligent Systems and Knowledge Engineering (ISKE) (2019) 1064-1070. https://www.kaggle.com/adityajn105/flickr8k
[23] Wang, H., Zhang, Y. and Yu, X., An overview of image caption generation methods. Computational intelligence and neuroscience, (2020).
[24] Oluwasammi, A., Aftab, M.U., Qin, Z., Ngo, S.T., Doan, T.V., Nguyen, S.B., Nguyen, S.H., and Nguyen, G.H., 2021. Features to Text: A Comprehensive Survey of Deep Learning on Semantic Segmentation and Image Captioning. Complexity, (2021).
[25] Sharma, H. and Jalal, A.S.,. Incorporating external knowledge for image captioning using CNN and LSTM. Modern Physics Letters B, 34(28) (2020) 2050315.
[26] Cornia, M., Stefanini, M., Baraldi, L. and Cucchiara, R., Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 10578-10587.