Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

How Far Are We from Intelligent Visual Deductive Reasoning? (2403.04732v3)

Published 7 Mar 2024 in cs.AI, cs.CL, and cs.CV
How Far Are We from Intelligent Visual Deductive Reasoning?

Abstract: Vision-LLMs (VLMs) have recently demonstrated incredible strides on diverse vision language tasks. We dig into vision-based deductive reasoning, a more sophisticated but less explored realm, and find previously unexposed blindspots in the current SOTA VLMs. Specifically, we leverage Raven's Progressive Matrices (RPMs), to assess VLMs' abilities to perform multi-hop relational and deductive reasoning relying solely on visual clues. We perform comprehensive evaluations of several popular VLMs employing standard strategies such as in-context learning, self-consistency, and Chain-of-thoughts (CoT) on three diverse datasets, including the Mensa IQ test, IntelligenceTest, and RAVEN. The results reveal that despite the impressive capabilities of LLMs in text-based reasoning, we are still far from achieving comparable proficiency in visual deductive reasoning. We found that certain standard strategies that are effective when applied to LLMs do not seamlessly translate to the challenges presented by visual reasoning tasks. A detailed analysis reveals that VLMs struggle to solve these tasks mainly because they are unable to perceive and comprehend multiple, confounding abstract patterns in RPM examples.

Evaluating Vision-LLMs on Raven's Progressive Matrices: A Systematic Assessment

Introduction

Recent advancements in Vision-LLMs (VLMs) have significantly contributed to the AI field, showcasing impressive capabilities in diverse vision-language tasks. However, the field of visual deductive reasoning, epitomized by Raven’s Progressive Matrices (RPMs), remains a challenging frontier. Our paper embarks on a comprehensive evaluation of current state-of-the-art VLMs in solving RPM problems, revealing significant insights into their capabilities and limitations.

Evaluation Framework

Our evaluation encompassed several leading VLMs, including GPT-4V and Gemini Pro, across three different datasets: Mensa IQ test, IntelligenceTest, and RAVEN. These datasets were chosen for their complexity and diversity, providing a robust platform to assess the VLMs’ abilities in visual deductive reasoning. We employed standard inference-time strategies such as in-context learning and self-consistency to probe their potential further.

Insights from the Benchmarks

The results, highlighting an accuracy range comparable to random guessing, suggest that despite the advancements in VLMs, their proficiency in complex visual deductive reasoning is yet to match that of simpler text-based reasoning tasks. It became evident that both in-context learning and self-consistency strategies, effective in LLMs, do not translate seamlessly to solving RPMs, indicating a significant opportunity for future research and model enhancement in this area.

Performance Bottlenecks

Our detailed analysis pinpointed perception as a critical bottleneck, with VLMs struggling to accurately perceive and describe abstract patterns within RPMs. This challenge was compounded by issues such as compounding and confounding errors, which affected the model's ability to describe patterns accurately. Conversely, when provided with oracle text descriptions or tasked with reasoning based on correct descriptions, VLMs demonstrated improved performance, suggesting that enhancing perception and reasoning capabilities could significantly boost their effectiveness in visual deductive reasoning tasks.

Influence of Prompting Structure

The impact of the prompt structure on model prediction was also scrutinized. Altering the order of task instructions and images led to a considerable fluctuation in model performance. Specifically, structuring prompts to delineate text prompts from images more clearly was found to enhance models' comprehension, underscoring the importance of prompt design in maximizing VLMs performance.

Future Directions

Our findings underscore the necessity for ongoing research to address the identified limitations in VLMs, particularly in improving their perceptual and reasoning capabilities. Further exploration into structured prompting, contrastive learning, and reinforcement learning algorithms could offer pathways to advancing VLMs' proficiency in visual deductive reasoning, bringing us closer to achieving human-like understanding and reasoning in AI systems.

Conclusion

This systematic evaluation reveals substantial gaps in current VLMs' abilities to tackle complex visual deductive reasoning tasks. While the models excel in various vision-language tasks, RPMs pose unique challenges that necessitate further innovation and research. Our paper not only benchmarks current capabilities but also sets a foundation for future advancements in AI's visual reasoning domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Lmrl gym: Benchmarks for multi-turn reinforcement learning with language models. arXiv preprint arXiv:2311.18232, 2023.
  2. Communicating natural programs to humans and machines. Advances in Neural Information Processing Systems, 35:3731–3743, 2022.
  3. The curious case of nonverbal abstract reasoning with multi-modal large language models. arXiv preprint arXiv:2401.12117, 2024.
  4. Self-imagine: Effective unimodal reasoning with multimodal models using self-imagination. arXiv preprint arXiv:2401.08025, 2024.
  5. Neural module networks. In CVPR, June 2016.
  6. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp.  2425–2433, 2015.
  7. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. arXiv preprint arXiv:2401.12168, 2024. URL https://arxiv.org/abs/2401.12168.
  10. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  11. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  12. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  13. Guesswhat?! visual object discovery through multi-modal dialogue. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  5503–5512, 2017.
  14. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021.
  15. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  16. Jie Huang and Kevin Chen-Chuan Chang. Towards reasoning in large language models: A survey. arXiv preprint arXiv:2212.10403, 2022.
  17. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In CVPR, pp.  2901–2910, 2017.
  18. ReferItGame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  787–798, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1086. URL https://aclanthology.org/D14-1086.
  19. A computational model for solving problems from the raven’s progressive matrices intelligence test using iconic visual representations. Cognitive Systems Research, 22:47–66, 2013.
  20. Llms as factual reasoners: Insights from existing benchmarks and beyond. arXiv preprint arXiv:2305.14540, 2023.
  21. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  22. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  23. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
  24. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. In ICLR, 2024.
  25. An in-depth look at gemini’s language abilities. arXiv e-prints, pp.  arXiv–2312, 2023.
  26. OpenAI. Gpt-4 technical report, 2023.
  27. Language models as knowledge bases? In EMNLP, 2019.
  28. Superglue: Learning feature matching with graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4938–4947, 2020.
  29. Textcaps: a dataset for image captioning with reading comprehension. In ECCV, pp.  742–758. Springer, 2020.
  30. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  31. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  32. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  33. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  34. the Dawn of Lmms: Preliminary Explorations With Gpt-4v(ision), 2023.
  35. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
  36. Raven: A dataset for relational and analogical visual reasoning. In CVPR, pp.  5317–5327, 2019.
  37. the Entity-deduction Arena: A Playground for Probing the Conversational Reasoning and Planning Capabilities of Llms. arXiv preprint arXiv:2310.01468, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yizhe Zhang (127 papers)
  2. He Bai (50 papers)
  3. Ruixiang Zhang (69 papers)
  4. Jiatao Gu (83 papers)
  5. Shuangfei Zhai (49 papers)
  6. Josh Susskind (37 papers)
  7. Navdeep Jaitly (67 papers)
Citations (8)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com