Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Good Questions Help Zero-Shot Image Reasoning (2312.01598v2)

Published 4 Dec 2023 in cs.CV

Abstract: Aligning the recent LLMs with computer vision models leads to large vision-LLMs (LVLMs), which have paved the way for zero-shot image reasoning tasks. However, LVLMs are usually trained on short high-level captions only referring to sparse focus regions in images. Such a ``tunnel vision'' limits LVLMs to exploring other relevant contexts in complex scenes. To address this challenge, we introduce Question-Driven Visual Exploration (QVix), a novel prompting strategy that enhances the exploratory capabilities of LVLMs in zero-shot reasoning tasks. QVix leverages LLMs' strong language prior to generate input-exploratory questions with more details than the original query, guiding LVLMs to explore visual content more comprehensively and uncover subtle or peripheral details. QVix enables a wider exploration of visual scenes, improving the LVLMs' reasoning accuracy and depth in tasks such as visual question answering and visual entailment. Our evaluations on various challenging zero-shot vision-language benchmarks, including ScienceQA and fine-grained visual classification, demonstrate that QVix significantly outperforms existing methods, highlighting its effectiveness in bridging the gap between complex visual data and LVLMs' exploratory abilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
  2. A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326, 2015.
  3. See, think, confirm: Interactive prompting between vision and language models for knowledge-based visual reasoning. arXiv preprint arXiv:2301.05226, 2023.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  5. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  6. Bridging machine learning and logical reasoning by abductive learning. Advances in Neural Information Processing Systems, 32, 2019.
  7. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. CoRR, abs/2305.17306, 2023.
  8. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023a.
  9. Fine-tuning multimodal llms to follow zero-shot demonstrative instructions, 2023b.
  10. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  11. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  12. Do emergent abilities exist in quantized large language models: An empirical study. arXiv preprint arXiv:2307.08072, 2023c.
  13. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  14. Fine-grained visual classification of aircraft. Technical report, 2013.
  15. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
  16. OpenAI. Gpt-4 technical report, 2023.
  17. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
  18. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  19. Improving language understanding by generative pre-training. 2018.
  20. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  21. Caltech-ucsd birds-200-2011 (cub-200-2011). Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
  22. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  23. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022a.
  24. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022b.
  25. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  26. Visual entailment task for visually-grounded language learning. arXiv preprint arXiv:1811.10582, 2018.
  27. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706, 2019.
  28. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pages 1–10, 2022.
  29. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
  30. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9, 2023.
  31. Idealgpt: Iteratively decomposing vision and language reasoning via large language models. arXiv preprint arXiv:2305.14985, 2023.
  32. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023.
  33. Ddcot: Duty-distinct chain-of-thought prompting for multimodal reasoning in language models. In NeurIPS2023, 2023.
  34. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
  35. Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions. arXiv preprint arXiv:2303.06594, 2023a.
  36. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023b.
  37. Solving math word problems concerning systems of equations with gpt-3. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 15972–15979, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Kaiwen Yang (6 papers)
  2. Tao Shen (87 papers)
  3. Xinmei Tian (50 papers)
  4. Xiubo Geng (36 papers)
  5. Chongyang Tao (61 papers)
  6. Dacheng Tao (826 papers)
  7. Tianyi Zhou (172 papers)
Citations (5)
Github Logo Streamline Icon: https://streamlinehq.com