Decoupling the components of geometric understanding in Vision Language Models (2503.03840v1)

Published 5 Mar 2025 in cs.CV and cs.LG

Abstract: Understanding geometry relies heavily on vision. In this work, we evaluate whether state-of-the-art vision LLMs (VLMs) can understand simple geometric concepts. We use a paradigm from cognitive science that isolates visual understanding of simple geometry from the many other capabilities it is often conflated with such as reasoning and world knowledge. We compare model performance with human adults from the USA, as well as with prior research on human adults without formal education from an Amazonian indigenous group. We find that VLMs consistently underperform both groups of human adults, although they succeed with some concepts more than others. We also find that VLM geometric understanding is more brittle than human understanding, and is not robust when tasks require mental rotation. This work highlights interesting differences in the origin of geometric understanding in humans and machines -- e.g. from printed materials used in formal education vs. interactions with the physical world or a combination of the two -- and a small step toward understanding these differences.

Authors (7)

Eliza Kosoy (9 papers)
Annya Dahmani (1 paper)
Andrew K. Lampinen (24 papers)
Iulia M. Comsa (7 papers)
Soojin Jeong (2 papers)
Ishita Dasgupta (35 papers)
Kelsey Allen (10 papers)

Summary

The paper introduces a method to decouple the visual understanding of geometry from other cognitive abilities in Vision Language Models (VLMs) by adapting cognitive science stimuli.
Findings show VLMs underperform compared to humans (American adults and Munduruku indigenous group) in recognizing simple geometric concepts, particularly struggling with tasks requiring mental rotation.
The research highlights the need for VLM frameworks to better incorporate human-like geometric understanding and spatial manipulation capabilities, potentially by integrating diverse learning experiences.

Decoupling Components of Geometric Understanding in Vision LLMs

The paper "Decoupling the components of geometric understanding in Vision LLMs" presents an analytical paper aimed at disentangling the visual comprehension of geometric concepts from other cognitive abilities within Vision LLMs (VLMs). This research reaches into the cognitive science paradigm to isolate visual understanding from intertwined capacities such as reasoning and world knowledge. The authors compare the performance of VLMs, human adults from the United States, and data from prior studies on the Munduruku, an Amazonian indigenous group without formal education, in understanding basic geometric concepts.

The central inquiry of this paper is to evaluate if existing VLMs can understand simple geometric concepts visually, independent of formal educational knowledge or reasoning abilities. The innovative aspect of this investigation is adapting stimuli used in cognitive science, specifically those utilized in studies of the Munduruku group by Dehaene et al., 2006. The experimental design employed prompts participants, human or model, to identify distinct geometric stimuli that deviate from a set concept, thus evaluating their geometric conceptual recognition.

Findings and Results

The findings reveal VLMs underperform when compared to both American adults and the Munduruku in recognizing geometric concepts. Human participants and the Munduruku demonstrated higher resilience with tasks requiring mental rotation, introducing an interesting variance when contrasting human and machine learning sources. Notably, the robustness of geometric understanding in VLMs is particularly diminished with mental rotation tasks. This deficiency in VLM performance compared to human subjects suggests an inherent gap between the visual processing abilities of artificial systems and human cognitive mechanisms, specifically in the manipulation and understanding of spatial forms in varying orientations.

Implications and Future Directions

Practically, the research underlines the necessity of refining VLM frameworks to encapsulate geometric understanding more akin to human cognition, specifically incorporating mechanisms for mental rotation and spatial manipulation. Theoretical implications pivot around redefining how artificial models assimilate visual data, potentially requiring them to bridge learning from both formal educational content and interactive, real-world experiences. Moreover, it stimulates exploration into integrating physical interaction data into VLMs to enhance geometric reasoning.

This paper opens avenues for further exploration into how different origins of learning—printed material versus interactive world exposure—impact the geometric understanding capabilities in both humans and machines. Future inquiries could delve into assessing varied VLM architectures, examining disparities based on training protocols, and exploring the blend of formal education content with experiential learning in human subjects. Additional research could also focus on viable modifications in VLM algorithms to emulate human-like schema of geometric cognition and mental rotation handling.

In summary, this paper delineates pivotal benchmarks in deciphering the scope and constraints of geometric understanding within contemporary VLMs, setting a foundation for nuanced advancements in AI cognitive mimicry aimed at paralleling human cognitive strengths in spatial and geometric perception.

Related Papers

Find Related Papers

Tweets

https://twitter.com/StphTphsn1/status/1898786211293958420

HackerNews

Decoupling the components of geometric understanding in Vision Language Models (1 point, 0 comments)