Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 80 tok/s

Gemini 2.5 Pro 28 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 38 tok/s Pro

GPT-4o 125 tok/s Pro

Kimi K2 181 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra (2506.08194v2)

Published 9 Jun 2025 in cs.CV

Abstract: Monocular 3D reconstruction methods and vision-LLMs (VLMs) demonstrate impressive results on standard benchmarks, yet their true understanding of geometric properties remains unclear. We introduce GIQ , a comprehensive benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. GIQ comprises synthetic and real-world images of 224 diverse polyhedra - including Platonic, Archimedean, Johnson, and Catalan solids, as well as stellations and compound shapes - covering varying levels of complexity and symmetry. Through systematic experiments involving monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification tasks, we reveal significant shortcomings in current models. State-of-the-art reconstruction algorithms trained on extensive 3D datasets struggle to reconstruct even basic geometric forms accurately. While foundation models effectively detect specific 3D symmetry elements via linear probing, they falter significantly in tasks requiring detailed geometric differentiation, such as mental rotation. Moreover, advanced vision-language assistants exhibit remarkably low accuracy on complex polyhedra, systematically misinterpreting basic properties like face geometry, convexity, and compound structures. GIQ is publicly available, providing a structured platform to highlight and address critical gaps in geometric intelligence, facilitating future progress in robust, geometry-aware representation learning.

Summary

The paper introduces the GIQ benchmark for evaluating 3D geometric reasoning in vision models using 224 simulated and real polyhedra.
The paper demonstrates that models struggle with tasks like monocular 3D reconstruction and mental rotation, while DINOv2 reaches up to 93% accuracy in symmetry detection.
The paper highlights the need for improved geometry-aware learning to enhance spatial reasoning in applications such as robotics and 3D modeling.

GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

The paper "GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra" introduces a novel benchmark—GIQ—for assessing the geometric reasoning capabilities of vision and vision-language foundation models. The benchmark is constructed using a diverse set of 224 polyhedra rendered both in synthetic and real-world environments. This includes shapes like Platonic solids, Archimedean solids, Johnson solids, and others, allowing for comprehensive evaluation of geometric intelligence.

Evaluation and Findings

The paper systematically evaluates contemporary models using four experimental frameworks: Monocular 3D Reconstruction, 3D Symmetry Detection, Mental Rotation Tests, and Zero-Shot Shape Classification.

Monocular 3D Reconstruction: The experiments reveal substantial deficiencies in current models, including Shap-E, Stable Fast 3D, and OpenLRM, in accurately reconstructing basic geometric forms from single images. Notably, even state-of-the-art models trained on extensive datasets struggle with reconstruction accuracy when faced with real polyhedra images.
3D Symmetry Detection: Linear probed embeddings demonstrate varied effectiveness in recognizing symmetry elements. DINOv2 excels notably in this task, achieving up to 93% accuracy in detecting 4-fold rotational symmetry from real-world images. This performance illustrates that foundation models potentially encode fundamental 3D structural properties implicitly.
Mental Rotation Tests: The ability of models to discern identical polyhedral shapes under rotation, especially between synthetic and real images, is notably weak. Performance approaches chance levels, implying significant challenges in achieving human-like spatial reasoning in visual models.
Zero-Shot Shape Classification: In evaluating vision-LLMs such as ChatGPT o3 and Gemini 2.5 Pro, systematic errors were identified, especially with complex polyhedral classes, indicating critical gaps in current models' geometric understanding.

Implications and Future Directions

The research underscores the limitations of existing vision models in handling 3D geometric reasoning tasks, emphasizing the need for advancements in geometry-aware representation learning. The benchmark laid out by GIQ presents a structured platform that can facilitate future progress in enhancing geometric intelligence, particularly in machine vision applications.

From a practical standpoint, improving these capabilities can significantly benefit fields such as robotics and 3D modeling, where precise spatial perception is crucial. Theoretically, the findings invite deeper exploration into how models encode and utilize geometric principles, potentially guiding the development of more robust architectures that integrate explicit 3D reasoning.

Overall, this benchmark serves as a critical diagnostic and evaluative tool, spotlighting gaps and prompting the development of sophisticated methods better aligned with human-level geometric understanding. The paper provides a foundational step toward expanding the scope of evaluation for vision models, driving progress in both AI research and application contexts.