GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning (2312.12241v1)

Published 19 Dec 2023 in cs.CV and cs.CL

Abstract: LLMs have shown impressive results for multi-hop mathematical reasoning when the input question is only textual. Many mathematical reasoning problems, however, contain both text and image. With the ever-increasing adoption of vision LLMs (VLMs), understanding their reasoning abilities for such problems is crucial. In this paper, we evaluate the reasoning capabilities of VLMs along various axes through the lens of geometry problems. We procedurally create a synthetic dataset of geometry questions with controllable difficulty levels along multiple axes, thus enabling a systematic evaluation. The empirical results obtained using our benchmark for state-of-the-art VLMs indicate that these models are not as capable in subjects like geometry (and, by generalization, other topics requiring similar reasoning) as suggested by previous benchmarks. This is made especially clear by the construction of our benchmark at various depth levels, since solving higher-depth problems requires long chains of reasoning rather than additional memorized knowledge. We release the dataset for further research in this area.

PDF HTML Abstract

Introduction

LLMs have made significant progress in multi-hop reasoning, particularly in the field of mathematical reasoning. However, most existing benchmarks focus on textual problems, overlooking tasks that require understanding both text and images. Geometry is a key area in which problems often consist of both textual descriptions and visual diagrams.

Evaluation of Vision-LLMs (VLMs)

A synthetic dataset—GeomVerse—was created to carry out a systematic evaluation of VLMs. Spanning various difficulty levels across multiple parameters, this dataset enables an in-depth analysis of model capabilities in geometric reasoning. The benchmarks used in the paper are designed to extend beyond geometry, potentially revealing general reasoning abilities applicable to other text-and-image reasoning challenges.

Key Findings

The empirical evidence suggests that state-of-the-art VLMs are not as adept in geometry as previous benchmarks have implied. They particularly struggle with higher-depth problems where deep chains of reasoning and extensive computational steps are required. Although the models exhibit some robustness to variations in image representation, they are vulnerable to distractors—additional but irrelevant information—which significantly drops performance.

Further Research

The release of the GeomVerse dataset is set to stimulate further exploration in this domain, with the hope of closing the identified gaps in VLM capabilities. The paper's insights highlight the importance of model training on both the complete solution-generating processes and out-of-distribution examples to aid real-world applications, such as building better AI tutors and other educational tools.

PDF Markdown Bookmark Chat (Pro)

References (50)

Authors (6)

Mehran Kazemi (26 papers)
Hamidreza Alvari (22 papers)
Ankit Anand (41 papers)
Jialin Wu (30 papers)
Xi Chen (1036 papers)
Radu Soricut (54 papers)

Citations (35)

View on Semantic Scholar

Tweets

https://twitter.com/22146921/status/1737376448158294194

https://twitter.com/kazemi_sm/status/1743042244293083264