Introduction
LLMs have made significant progress in multi-hop reasoning, particularly in the field of mathematical reasoning. However, most existing benchmarks focus on textual problems, overlooking tasks that require understanding both text and images. Geometry is a key area in which problems often consist of both textual descriptions and visual diagrams.
Evaluation of Vision-LLMs (VLMs)
A synthetic dataset—GeomVerse—was created to carry out a systematic evaluation of VLMs. Spanning various difficulty levels across multiple parameters, this dataset enables an in-depth analysis of model capabilities in geometric reasoning. The benchmarks used in the paper are designed to extend beyond geometry, potentially revealing general reasoning abilities applicable to other text-and-image reasoning challenges.
Key Findings
The empirical evidence suggests that state-of-the-art VLMs are not as adept in geometry as previous benchmarks have implied. They particularly struggle with higher-depth problems where deep chains of reasoning and extensive computational steps are required. Although the models exhibit some robustness to variations in image representation, they are vulnerable to distractors—additional but irrelevant information—which significantly drops performance.
Further Research
The release of the GeomVerse dataset is set to stimulate further exploration in this domain, with the hope of closing the identified gaps in VLM capabilities. The paper's insights highlight the importance of model training on both the complete solution-generating processes and out-of-distribution examples to aid real-world applications, such as building better AI tutors and other educational tools.