Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IQBench: How "Smart'' Are Vision-Language Models? A Study with Human IQ Tests (2505.12000v1)

Published 17 May 2025 in cs.CV

Abstract: Although large Vision-LLMs (VLMs) have demonstrated remarkable performance in a wide range of multimodal tasks, their true reasoning capabilities on human IQ tests remain underexplored. To advance research on the fluid intelligence of VLMs, we introduce IQBench, a new benchmark designed to evaluate VLMs on standardized visual IQ tests. We focus on evaluating the reasoning capabilities of VLMs, which we argue are more important than the accuracy of the final prediction. Our benchmark is visually centric, minimizing the dependence on unnecessary textual content, thus encouraging models to derive answers primarily from image-based information rather than learned textual knowledge. To this end, we manually collected and annotated 500 visual IQ questions to prevent unintentional data leakage during training. Unlike prior work that focuses primarily on the accuracy of the final answer, we evaluate the reasoning ability of the models by assessing their explanations and the patterns used to solve each problem, along with the accuracy of the final prediction and human evaluation. Our experiments show that there are substantial performance disparities between tasks, with models such as o4-mini, gemini-2.5-flash, and claude-3.7-sonnet achieving the highest average accuracies of 0.615, 0.578, and 0.548, respectively. However, all models struggle with 3D spatial and anagram reasoning tasks, highlighting significant limitations in current VLMs' general reasoning abilities. In terms of reasoning scores, o4-mini, gemini-2.5-flash, and claude-3.7-sonnet achieved top averages of 0.696, 0.586, and 0.516, respectively. These results highlight inconsistencies between the reasoning processes of the models and their final answers, emphasizing the importance of evaluating the accuracy of the reasoning in addition to the final predictions.

Summary

Evaluating Vision-LLMs: Insights from the IQBench Study

The paper introduces IQBench, a novel benchmark aimed at evaluating Vision-LLMs (VLMs) using standardized human IQ tests that focus on fluid intelligence. The central objective of IQBench is not just to gauge accuracy but to explore the reasoning mechanisms of these models, making it a significant contribution to the understanding of artificial intelligence reasoning capabilities.

IQBench Characterization and Methodology

IQBench is composed of 500 manually curated visual IQ questions that span multiple cognitive domains, including pattern recognition, analogical reasoning, spatial reasoning, visual arithmetic, and more. The core principle guiding this dataset is the minimization of textual dependency, thereby allowing VLMs to derive conclusions from visual inputs primarily. The evaluation framework is dual-metric; it encompasses:

  • Accuracy Score: Traditional exact match metric for final predictions.
  • Reasoning Score: Assessed via an LLM-as-judge approach, where pre-trained models evaluate the coherence and correctness of VLMs' reasoning against annotated reasoning patterns.

Through these methods, IQBench seeks to provide a more granular understanding of how VLMs process visual information and rationalize their decisions.

Experimental Findings

The research tested multiple cutting-edge VLMs, including o4-mini, gemini-2.5-flash, and claude-3.7-sonnet. Across the tasks, o4-mini showed superior overall performance with an accuracy score of 0.615 and reasoning score of 0.696. Notably, all models showed marked difficulty with 3D spatial tasks and anagram reasoning, indicating significant limitations in their reasoning abilities.

Contrasts were observed between the two evaluation dimensions. For instance, reasoning scores were generally higher than accuracy scores across many models, indicating that models could provide plausible explanations despite reaching incorrect outcomes, highlighting a disconnect between reasoning robustness and decision accuracy.

Implications and Speculative Development

The findings of IQBench underscore critical areas for improvement in VLM architecture, particularly in tasks that involve detailed spatial manipulation and linguistic reasoning. The paper implies that models might need enhanced architectures that integrate reasoning processes more effectively with visual inputs to overcome present shortcomings. Additionally, this benchmark opens avenues for further exploration into reasoning transparency and cognitive alignment in developing AGI.

IQBench's emphasis on reasoning transparency aligns with future research directions to refine model capabilities in multimodal processing, ultimately advancing AI towards systems that approach human-like reasoning. The dual focus on accuracy and reasoning offers a deeper insight into cognitive dynamics, which can inform the design of models with improved interpretability and stronger alignment between explanation and conclusion.

Conclusion

IQBench sets notable precedence in evaluating reasoning in Vision-LLMs beyond mere answer accuracy. The paper highlights several VLMs' limitations, their reasoning capacities, and areas that demand further academic attention. As AI research progresses, the principles embedded in IQBench's evaluation methodology could guide the development of more sophisticated models, inching closer to AGI with enhanced reasoning capabilities. The benchmark itself paves the way for more profound inquiries into cognitive performance metrics crucial for real-world application of VLMs across diverse domains.

X Twitter Logo Streamline Icon: https://streamlinehq.com