Transformative Fusion: Vision Transformers and GPT-2 Unleashing New Frontiers in Image Captioning within Image Processing

Indrani Vasireddy; G.HimaBindu; Ratnamala.B

doi:10.55524/ijirem.2023.10.6.8

Abstract

In the ever-evolving digital landscape, this paper presents an innovative Image Caption Generator that seamlessly merges Vision Transformers (ViT) and GPT-2. By combining the strengths of computer vision and natural language processing (NLP), our paper aims to extract significant image features using ViT and generate contextual, human-like descriptions through GPT-2. The resultant system boasts an intuitive interface, allowing users to effortlessly receive coherent captions for uploaded images. This ground breaking technology holds immense potential for the visually impaired community, enhancing image-based content accessibility and overall user experiences.

The primary objective of our image caption generator paper is to develop a sys-tem that automates the generation of descriptive and coherent textual captions for images. This endeavor involves the integration of computer vision and NLP techniques, enabling the system to analyze the content of an image and produce relevant and meaningful textual descriptions. The broader goal is to improve the accessibility of visual content, enhance image search capabilities, and facilitate applications such as automated content tagging. Furthermore, the paper addresses the needs of visually impaired individuals by providing assistive technology that interprets and communicates image content effectively.

This paper exemplifies the symbiotic relationship between computer vision and NLP, illustrating how their integration can pave the way for transformative AI applications. The resulting synergy not only contributes to the development of advanced image captioning systems but also opens avenues for innovative applications across diverse domains. The conference presentation will delve into the technical aspects of our approach, showcasing the significance of this integration and its potential impact on the future of AI applications.

Keywords

Image Caption Generator Vision Transformers (ViT) GPT-2 Computer Vision Natural Language Processing.

References

Krishnakumar, K., Kousalya, S., Gokul, R., Karthikeyan, R., Kaviyarasu, D. (2020). "IMAGE CAPTION GENERATOR USING DEEP LEARNING," International Journal of Advanced Science and Technology.
R. Al Sobbahi and J. Tekli. "Low-light image enhancement using image-to-frequency fil-ter learning." In Image Analysis and Processing–ICIAP 2022: 21st International Conference, Lecce, Italy, May 23–27, 2022, Proceedings, Part II, pages 693–705. Springer, 2022.
J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, et al. "Flamingo: a visual language model for few-shot learning." Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
P. Anderson, B. Fernando, M. Johnson, and S. Gould. "SPICE: Semantic propositional image caption evaluation." In Computer Vision – ECCV 2016, pages 382–398, Manhattan, New York, USA, 2016. Springer International Publishing.
P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. "Bottom-up and top-down attention for image captioning and visual question answering." In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6077–6086.
Krizhevsky, A., Sutskever, I., Hinton, G. E. (2012). "ImageNet Classification with Deep Con-volutional Neural Networks." In Advances in neural information processing systems.
Hochreiter, S., Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735–1780.
Vinyals, O., Toshev, A., Bengio, S., Erhan, D. (2015). "Show and Tell: A Neural Image Caption Generator." In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ...Houlsby, N. (2021). "Image Transformer." arXiv preprint arXiv:2010.11929.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... Polosukhin, I. (2017). "Attention is All You Need." In Advances in neural information processing systems
Vasireddy, Indrani, Rajeev Wankar, and Raghavendra Rao Chillarige. "Recreation of a Sub-pod for a Killed Pod with Optimized Containers in Kubernetes." International Conference on Expert Clouds and Applications. Singapore: Springer Nature Singapore, 2022.

Cites this article as

I. Vasireddy, G.HimaBindu, Ratnamala.B, "Transformative Fusion: Vision Transformers and GPT-2 Unleashing New Frontiers in Image Captioning within Image Processing", International Journal of Innovative Research in Engineering & Management (IJIREM), Vol-10, Issue-6, Page No-51-55, 2023. Available from: https://doi.org/10.55524/ijirem.2023.10.6.8