Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts (2404.08589v1)

Published 12 Apr 2024 in cs.CV and cs.AI

Abstract: Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content. Over the past few years, numerous neural architectures have been suggested for the VQA problem. However, achieving success in zero-shot VQA remains a challenge due to its requirement for advanced generalization and reasoning skills. This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline. Specifically, we explore the efficacy of utilizing image captions instead of images and leveraging LLMs to establish a zero-shot setting. Since image captioning is the most crucial step in this process, we compare the impact of state-of-the-art image captioning models on VQA performance across various question types in terms of structure and semantics. We propose a straightforward and efficient question-driven image captioning approach within this pipeline to transfer contextual information into the question-answering (QA) model. This method involves extracting keywords from the question, generating a caption for each image-question pair using the keywords, and incorporating the question-driven caption into the LLM prompt. We evaluate the efficacy of using general-purpose and question-driven image captions in the VQA pipeline. Our study highlights the potential of employing image captions and harnessing the capabilities of LLMs to achieve competitive performance on GQA under the zero-shot setting. Our code is available at \url{https://github.com/ovguyo/captions-in-VQA}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  4. Mutan: Multimodal tucker fusion for visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2612–2620, 2017.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  8. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  10. Maarten Grootendorst. Keybert: Minimal keyword extraction with bert. Zenodo, 2020.
  11. From images to textual prompts: Zero-shot visual question answering with frozen large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10867–10877, 2023.
  12. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14953–14962, 2023.
  13. Image captioning: Transforming objects into words. Advances in neural information processing systems, 32, 2019.
  14. Attention on attention for image captioning. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4634–4643, 2019.
  15. Compositional attention networks for machine reasoning. arXiv preprint arXiv:1803.03067, 2018.
  16. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  17. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
  18. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021.
  19. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  20. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  21. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  22. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  23. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  24. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. arXiv preprint arXiv:1904.12584, 2019.
  25. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  26. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
  27. OpenAI. Chatgpt. https://openai.com/blog/chatgpt, 2023.
  28. Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  30. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  31. Fusecap: Leveraging large language models for enriched fused image captions. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5689–5700, 2024.
  32. A simple neural network module for relational reasoning. Advances in neural information processing systems, 30, 2017.
  33. A survey of methods, datasets and evaluation metrics for visual question answering. Image and Vision Computing, 116:104327, 2021.
  34. Visual question answering using deep learning: A survey and performance analysis. In Computer Vision and Image Processing: 5th International Conference, CVIP 2020, Prayagraj, India, December 4-6, 2020, Revised Selected Papers, Part II 5, pages 75–86. Springer, 2021.
  35. Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 2019.
  36. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  37. Plug-and-play vqa: Zero-shot vqa by conjoining large pretrained models with zero training. arXiv preprint arXiv:2210.08773, 2022.
  38. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  39. Evaluating open-qa evaluation. Advances in Neural Information Processing Systems, 36, 2024.
  40. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  41. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057. PMLR, 2015.
  42. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3081–3089, 2022.
  43. Neural-symbolic vqa: Disentangling reasoning from vision and language understanding. Advances in neural information processing systems, 31, 2018.
  44. Proto: Program-guided transformer for program-guided tasks. Advances in neural information processing systems, 34:17021–17036, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Övgü Özdemir (2 papers)
  2. Erdem Akagündüz (20 papers)
Citations (3)