Evaluating Vision-LLMs on Raven's Progressive Matrices: A Systematic Assessment
Introduction
Recent advancements in Vision-LLMs (VLMs) have significantly contributed to the AI field, showcasing impressive capabilities in diverse vision-language tasks. However, the field of visual deductive reasoning, epitomized by Raven’s Progressive Matrices (RPMs), remains a challenging frontier. Our paper embarks on a comprehensive evaluation of current state-of-the-art VLMs in solving RPM problems, revealing significant insights into their capabilities and limitations.
Evaluation Framework
Our evaluation encompassed several leading VLMs, including GPT-4V and Gemini Pro, across three different datasets: Mensa IQ test, IntelligenceTest, and RAVEN. These datasets were chosen for their complexity and diversity, providing a robust platform to assess the VLMs’ abilities in visual deductive reasoning. We employed standard inference-time strategies such as in-context learning and self-consistency to probe their potential further.
Insights from the Benchmarks
The results, highlighting an accuracy range comparable to random guessing, suggest that despite the advancements in VLMs, their proficiency in complex visual deductive reasoning is yet to match that of simpler text-based reasoning tasks. It became evident that both in-context learning and self-consistency strategies, effective in LLMs, do not translate seamlessly to solving RPMs, indicating a significant opportunity for future research and model enhancement in this area.
Performance Bottlenecks
Our detailed analysis pinpointed perception as a critical bottleneck, with VLMs struggling to accurately perceive and describe abstract patterns within RPMs. This challenge was compounded by issues such as compounding and confounding errors, which affected the model's ability to describe patterns accurately. Conversely, when provided with oracle text descriptions or tasked with reasoning based on correct descriptions, VLMs demonstrated improved performance, suggesting that enhancing perception and reasoning capabilities could significantly boost their effectiveness in visual deductive reasoning tasks.
Influence of Prompting Structure
The impact of the prompt structure on model prediction was also scrutinized. Altering the order of task instructions and images led to a considerable fluctuation in model performance. Specifically, structuring prompts to delineate text prompts from images more clearly was found to enhance models' comprehension, underscoring the importance of prompt design in maximizing VLMs performance.
Future Directions
Our findings underscore the necessity for ongoing research to address the identified limitations in VLMs, particularly in improving their perceptual and reasoning capabilities. Further exploration into structured prompting, contrastive learning, and reinforcement learning algorithms could offer pathways to advancing VLMs' proficiency in visual deductive reasoning, bringing us closer to achieving human-like understanding and reasoning in AI systems.
Conclusion
This systematic evaluation reveals substantial gaps in current VLMs' abilities to tackle complex visual deductive reasoning tasks. While the models excel in various vision-language tasks, RPMs pose unique challenges that necessitate further innovation and research. Our paper not only benchmarks current capabilities but also sets a foundation for future advancements in AI's visual reasoning domain.