- The paper introduces the GIQ benchmark for evaluating 3D geometric reasoning in vision models using 224 simulated and real polyhedra.
- The paper demonstrates that models struggle with tasks like monocular 3D reconstruction and mental rotation, while DINOv2 reaches up to 93% accuracy in symmetry detection.
- The paper highlights the need for improved geometry-aware learning to enhance spatial reasoning in applications such as robotics and 3D modeling.
GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra
The paper "GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra" introduces a novel benchmark—GIQ—for assessing the geometric reasoning capabilities of vision and vision-language foundation models. The benchmark is constructed using a diverse set of 224 polyhedra rendered both in synthetic and real-world environments. This includes shapes like Platonic solids, Archimedean solids, Johnson solids, and others, allowing for comprehensive evaluation of geometric intelligence.
Evaluation and Findings
The paper systematically evaluates contemporary models using four experimental frameworks: Monocular 3D Reconstruction, 3D Symmetry Detection, Mental Rotation Tests, and Zero-Shot Shape Classification.
- Monocular 3D Reconstruction: The experiments reveal substantial deficiencies in current models, including Shap-E, Stable Fast 3D, and OpenLRM, in accurately reconstructing basic geometric forms from single images. Notably, even state-of-the-art models trained on extensive datasets struggle with reconstruction accuracy when faced with real polyhedra images.
- 3D Symmetry Detection: Linear probed embeddings demonstrate varied effectiveness in recognizing symmetry elements. DINOv2 excels notably in this task, achieving up to 93% accuracy in detecting 4-fold rotational symmetry from real-world images. This performance illustrates that foundation models potentially encode fundamental 3D structural properties implicitly.
- Mental Rotation Tests: The ability of models to discern identical polyhedral shapes under rotation, especially between synthetic and real images, is notably weak. Performance approaches chance levels, implying significant challenges in achieving human-like spatial reasoning in visual models.
- Zero-Shot Shape Classification: In evaluating vision-LLMs such as ChatGPT o3 and Gemini 2.5 Pro, systematic errors were identified, especially with complex polyhedral classes, indicating critical gaps in current models' geometric understanding.
Implications and Future Directions
The research underscores the limitations of existing vision models in handling 3D geometric reasoning tasks, emphasizing the need for advancements in geometry-aware representation learning. The benchmark laid out by GIQ presents a structured platform that can facilitate future progress in enhancing geometric intelligence, particularly in machine vision applications.
From a practical standpoint, improving these capabilities can significantly benefit fields such as robotics and 3D modeling, where precise spatial perception is crucial. Theoretically, the findings invite deeper exploration into how models encode and utilize geometric principles, potentially guiding the development of more robust architectures that integrate explicit 3D reasoning.
Overall, this benchmark serves as a critical diagnostic and evaluative tool, spotlighting gaps and prompting the development of sophisticated methods better aligned with human-level geometric understanding. The paper provides a foundational step toward expanding the scope of evaluation for vision models, driving progress in both AI research and application contexts.