Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering (2402.11058v3)

Published 16 Feb 2024 in cs.CV and cs.CL

Abstract: Visual Question Answering (VQA) often involves diverse reasoning scenarios across Vision and Language (V&L). Most prior VQA studies, however, have merely focused on assessing the model's overall accuracy without evaluating it on different reasoning cases. Furthermore, some recent works observe that conventional Chain-of-Thought (CoT) prompting fails to generate effective reasoning for VQA, especially for complex scenarios requiring multi-hop reasoning. In this paper, we propose II-MMR, a novel idea to identify and improve multi-modal multi-hop reasoning in VQA. In specific, II-MMR takes a VQA question with an image and finds a reasoning path to reach its answer using two novel language promptings: (i) answer prediction-guided CoT prompt, or (ii) knowledge triplet-guided prompt. II-MMR then analyzes this path to identify different reasoning cases in current VQA benchmarks by estimating how many hops and what types (i.e., visual or beyond-visual) of reasoning are required to answer the question. On popular benchmarks including GQA and A-OKVQA, II-MMR observes that most of their VQA questions are easy to answer, simply demanding "single-hop" reasoning, whereas only a few questions require "multi-hop" reasoning. Moreover, while the recent V&L model struggles with such complex multi-hop reasoning questions even using the traditional CoT method, II-MMR shows its effectiveness across all reasoning cases in both zero-shot and fine-tuning settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Don’t just assume; look and answer: Overcoming priors for visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4971–4980.
  2. VQA: Visual question answering. In Proceedings of ICCV.
  3. Investigating prompting techniques for zero-and few-shot visual question answering. arXiv preprint arXiv:2306.09996.
  4. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794.
  5. Measuring and improving chain-of-thought reasoning in vision-language models. arXiv preprint arXiv:2309.04461.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  7. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  8. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
  9. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369.
  10. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of CVPR.
  11. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709.
  12. Active retrieval augmented generation. In EMNLP.
  13. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  14. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73.
  15. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML.
  16. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975.
  17. Vqa-e: Explaining, elaborating, and enhancing your answers for visual questions. In Proceedings of the European Conference on Computer Vision (ECCV), pages 552–567.
  18. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744.
  19. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204.
  20. OpenAI2023. 2023. Gpt-4v(ision) technical work and authors.
  21. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, pages 70–80.
  22. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer.
  23. Coarse-to-fine contrastive learning in image-text-graph space for improved vision-language compositionality. In EMNLP.
  24. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  25. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of EMNLP.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  27. Towards reasoning-aware explainable vqa. In NeurIPS Workshop on Trustworthy and Socially Responsible Machine Learning (TSRML).
  28. Git: A generative image-to-text transformer for vision and language. Transactions on Machine Learning Research (TMLR).
  29. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  30. Improving vqa and its explanations\\\backslash\\\\backslash\by comparing competing explanations. arXiv preprint arXiv:2006.15631.
  31. Jialin Wu and Raymond Mooney. 2019. Self-critical reasoning for robust visual question answering. Advances in Neural Information Processing Systems, 32.
  32. Visual clues: Bridging vision and language foundations for image paragraph captioning. Advances in Neural Information Processing Systems, 35:17287–17300.
  33. Star: Self-taught reasoner bootstrapping reasoning with reasoning. In NeurIPS.
  34. Prototype-based embedding network for scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22783–22792.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jihyung Kil (10 papers)
  2. Farideh Tavazoee (2 papers)
  3. Dongyeop Kang (72 papers)
  4. Joo-Kyung Kim (12 papers)