GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

Published 9 Jun 2025 in cs.CV | (2506.08194v2)

Abstract: Monocular 3D reconstruction methods and vision-LLMs (VLMs) demonstrate impressive results on standard benchmarks, yet their true understanding of geometric properties remains unclear. We introduce GIQ , a comprehensive benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models. GIQ comprises synthetic and real-world images of 224 diverse polyhedra - including Platonic, Archimedean, Johnson, and Catalan solids, as well as stellations and compound shapes - covering varying levels of complexity and symmetry. Through systematic experiments involving monocular 3D reconstruction, 3D symmetry detection, mental rotation tests, and zero-shot shape classification tasks, we reveal significant shortcomings in current models. State-of-the-art reconstruction algorithms trained on extensive 3D datasets struggle to reconstruct even basic geometric forms accurately. While foundation models effectively detect specific 3D symmetry elements via linear probing, they falter significantly in tasks requiring detailed geometric differentiation, such as mental rotation. Moreover, advanced vision-language assistants exhibit remarkably low accuracy on complex polyhedra, systematically misinterpreting basic properties like face geometry, convexity, and compound structures. GIQ is publicly available, providing a structured platform to highlight and address critical gaps in geometric intelligence, facilitating future progress in robust, geometry-aware representation learning.

Abstract PDF Upgrade to Chat

Summary

The paper introduces the GIQ benchmark for evaluating 3D geometric reasoning in vision models using 224 simulated and real polyhedra.
The paper demonstrates that models struggle with tasks like monocular 3D reconstruction and mental rotation, while DINOv2 reaches up to 93% accuracy in symmetry detection.
The paper highlights the need for improved geometry-aware learning to enhance spatial reasoning in applications such as robotics and 3D modeling.

GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

The paper "GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra" introduces a novel benchmark—GIQ—for assessing the geometric reasoning capabilities of vision and vision-language foundation models. The benchmark is constructed using a diverse set of 224 polyhedra rendered both in synthetic and real-world environments. This includes shapes like Platonic solids, Archimedean solids, Johnson solids, and others, allowing for comprehensive evaluation of geometric intelligence.

Evaluation and Findings

The study systematically evaluates contemporary models using four experimental frameworks: Monocular 3D Reconstruction, 3D Symmetry Detection, Mental Rotation Tests, and Zero-Shot Shape Classification.

Monocular 3D Reconstruction: The experiments reveal substantial deficiencies in current models, including Shap-E, Stable Fast 3D, and OpenLRM, in accurately reconstructing basic geometric forms from single images. Notably, even state-of-the-art models trained on extensive datasets struggle with reconstruction accuracy when faced with real polyhedra images.
3D Symmetry Detection: Linear probed embeddings demonstrate varied effectiveness in recognizing symmetry elements. DINOv2 excels notably in this task, achieving up to 93% accuracy in detecting 4-fold rotational symmetry from real-world images. This performance illustrates that foundation models potentially encode fundamental 3D structural properties implicitly.
Mental Rotation Tests: The ability of models to discern identical polyhedral shapes under rotation, especially between synthetic and real images, is notably weak. Performance approaches chance levels, implying significant challenges in achieving human-like spatial reasoning in visual models.
Zero-Shot Shape Classification: In evaluating vision-LLMs such as ChatGPT o3 and Gemini 2.5 Pro, systematic errors were identified, especially with complex polyhedral classes, indicating critical gaps in current models' geometric understanding.

Implications and Future Directions

The research underscores the limitations of existing vision models in handling 3D geometric reasoning tasks, emphasizing the need for advancements in geometry-aware representation learning. The benchmark laid out by GIQ presents a structured platform that can facilitate future progress in enhancing geometric intelligence, particularly in machine vision applications.

From a practical standpoint, improving these capabilities can significantly benefit fields such as robotics and 3D modeling, where precise spatial perception is crucial. Theoretically, the findings invite deeper exploration into how models encode and utilize geometric principles, potentially guiding the development of more robust architectures that integrate explicit 3D reasoning.

Overall, this benchmark serves as a critical diagnostic and evaluative tool, spotlighting gaps and prompting the development of sophisticated methods better aligned with human-level geometric understanding. The paper provides a foundational step toward expanding the scope of evaluation for vision models, driving progress in both AI research and application contexts.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Overview

This paper introduces GIQ, a new test set (called a benchmark) that checks how well today’s AI vision systems understand 3D shapes. The authors use a special group of 3D objects called polyhedra—things like cubes, pyramids, and more complex “star-like” solids—to see whether AI can truly reason about geometry, not just recognize patterns in flat images. They include both computer-made pictures and photos of real paper models to make the tests fair and realistic.

Key Questions

The paper asks simple but important questions:

Can AI rebuild a 3D shape from just one photo?
Can AI spot true 3D symmetry (like whether a shape looks the same after a 90° or 72° rotation)?
Can AI tell if two pictures show the same object from different angles (a “mental rotation” skill humans use)?
Can AI name the type of polyhedron it sees without being trained specifically for that task?

How They Did It

The authors built a carefully designed dataset with 224 different polyhedra:

Synthetic images: computer-rendered pictures from many viewpoints under consistent lighting.
Real-world images: photos of hand-built paper models, taken indoors and outdoors in different conditions.

Then they ran four types of tests:

Monocular 3D Reconstruction

This means trying to rebuild a full 3D model from just one image, like guessing the whole shape from a single photo. Think of it as making a 3D toy from a single snapshot.

They tested three popular systems:

Shap-E
Stable Fast 3D
OpenLRM

3D Symmetry Detection

They asked: does the shape have certain kinds of symmetry? For example:

Central point reflection: the shape looks the same if you flip it through its center.
4-fold rotation: the shape looks the same every 90° turn.
5-fold rotation: the shape looks the same every 72° turn.

To do this, they used “linear probing.” Imagine peeking inside a trained AI and adding a simple checker that asks its internal features, “Is there 4-fold symmetry here?” If even a simple checker can answer, the model likely learned useful geometric signals.

Mental Rotation Test

This checks if an AI can tell whether two images (one synthetic, one real photo) show the same object, just rotated. Humans do this naturally—like matching a Lego model you built with a rotated picture of it.

Zero-Shot Classification

They showed pictures to advanced vision-language assistants (AIs that can look at pictures and talk about them) and asked: “What is the name of this polyhedron?” No training on this dataset—just general knowledge.

Models tested included:

ChatGPT o3
ChatGPT o4-mini-high
Gemini 2.5 Pro
Claude 3.7 Sonnet

Main Findings and Why They Matter

Here are the most important results:

Rebuilding 3D shapes from one image is still very hard. Even with millions of training examples, popular reconstruction tools often failed on anything beyond simple shapes or neat synthetic photos. They struggled especially with real-world photos of complex solids.
Some vision models quietly learn about symmetry. Using linear probing, DINOv2’s features were surprisingly good at recognizing certain 3D symmetries (like 4-fold rotation), even though it wasn’t trained specifically for 3D geometry. This suggests these models store useful geometric clues inside.
Mental rotation across synthetic and real photos is a big challenge. When asked to match a computer rendering to a real photo of the same object, most models performed near random guessing on hard cases. They had trouble telling apart very similar shapes viewed from different angles.
Naming complex polyhedra is tough for vision-language assistants. They could name simple, famous shapes (like the Platonic solids: cube, tetrahedron, etc.), but got confused by more complex types (like Johnson and Catalan solids, compounds, or non-convex “star” shapes). They often misread basic properties such as:
- What kinds of faces the shape has (triangles, squares, pentagons, etc.)
- Whether the shape is convex (bulging outward) or non-convex (has inward dents)
- Whether the object is a compound (two or more shapes combined)

These findings matter because they reveal a gap: today’s impressive AI models don’t consistently “think in 3D” the way humans do. They can do well on many benchmarks, but still struggle with core geometric understanding, especially when images look different from their training data.

Implications and Impact

Better geometry-aware AI is needed. If we want reliable robots, AR/VR systems, or scientific tools that deal with shapes in the real world, models must understand 3D structure—not just 2D patterns.
GIQ is a practical tool for progress. By offering clear, ground-truth tests on symmetry, complexity, and shape categories across both synthetic and real photos, GIQ helps researchers pinpoint what their models can and can’t do—and measure improvements over time.
Hidden strengths can be tapped. The success of simple symmetry checks using DINOv2’s features suggests that some geometric understanding already exists inside popular models. With the right training and evaluation, we can build on those strengths.

In short, GIQ shines a light on where AI falls short in geometric reasoning and offers a roadmap to build models that truly “get” 3D.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of concrete, actionable gaps that the paper leaves unresolved and that future work could address:

Quantitative reconstruction metrics are missing: the monocular 3D reconstruction evaluation is largely qualitative (front/side renders), with no Chamfer/L2/IoU/normal consistency or mesh-topology metrics against ground truth (which could be done for synthetic shapes; wild models would need scans).
No ground-truth 3D for real objects: the paper does not provide registered 3D meshes or poses for the paper models, precluding quantitative evaluation of reconstruction and rotation estimation on wild images.
Limited annotation scope for symmetry: symmetry detection considers only central inversion, 4-fold, and 5-fold rotations; it omits plane reflections, 2-, 3-, 6-fold rotations, multiple axes, dihedral/icosahedral symmetry groups, chirality, and symmetry-group structure.
Johnson solids excluded due to “viewpoint ambiguity”: symmetry detection excludes Johnson solids rather than proposing viewpoint-invariant annotation protocols (e.g., 3D labels tied to object frames) or evaluation strategies robust to ambiguous views.
Ambiguity in dataset composition counts: the table suggests overlapping categories (e.g., stellations vs Kepler–Poinsot and compounds) and does not clearly disambiguate unique object counts per group; a precise accounting and a non-overlapping taxonomy are needed.
Synthetic rendering realism constraints: renderings disable global illumination and use a single diffuse material and constant environment lighting; the sensitivity of results to realistic materials, indirect light, specularities, textures, and shadows is not studied.
Color-coding bias in wild images: many paper models use intentional color coding (e.g., convex vs concave faces), which may cue shortcuts; robustness to recoloring or grayscale conversion is not evaluated.
Domain-gap decomposition is not quantified: poor synthetic-to-real generalization (e.g., in MRT) is shown but not dissected—effects of background, lighting, texture, noise, resolution, and cropping are not isolated with controlled ablations.
Background removal for reconstruction may bias results: wild images are center-cropped and background-removed prior to reconstruction; the impact of this preprocessing on fairness and real-world applicability is not assessed.
No human baseline: the paper lacks human performance on the hard MRT (synthetic-wild) and symmetry tasks to calibrate difficulty and quantify the gap between humans and models.
MRT task scope is narrow: only pairwise “same/different under rotation” is tested; rotation angle/axis estimation, rotation-invariant embedding construction, and equivariance tests (e.g., SE(3)-equivariant encoders) are not evaluated.
Probe-only symmetry detection: only linear probing is tested; whether non-linear probes, fine-tuning, contrastive training with symmetry labels, or group-equivariant architectures improve performance is unexplored.
Multi-view and temporal inputs are not leveraged: all evaluations use single images; multi-view/video protocols and their gains (for reconstruction, MRT, symmetry) are not explored.
VLM zero-shot protocol is not controlled: models appear to have heterogeneous tool access (e.g., web search by Gemini), making fairness unclear; tool-free, standardized prompts and confidence/top-k reporting are needed.
Naming-only classification lacks structure: zero-shot classification asks for a name with no ontology, multiple-choice, or property-structured prompts (faces/edges/vertices, convexity, uniformity), limiting diagnostic power and comparability.
Error analysis is anecdotal: VLM failures are illustrated qualitatively but lack systematic confusion matrices, error taxonomies, and correlations with shape properties (face counts, symmetry group, nonconvexity, compound structure).
Hard split for MRT is manually curated: criteria for selecting “visually/geometrically similar” pairs are subjective and not reproducible; algorithmic, measurable similarity criteria and dataset statistics are needed.
Viewpoint sampling and occlusion robustness are untested: effects of extreme viewpoints, partial occlusion, clutter, truncation, and self-shadowing on symmetry/MRT/classification are not studied.
Generalization beyond polyhedra is open: it is unclear whether conclusions transfer to other geometric families (curved/smooth solids, CAD parts, articulated objects); extending the benchmark to these domains is an open direction.
Lack of pose annotations for wild images: without camera/object pose, it is impossible to study viewpoint invariance/equivariance or to align synthetic and real views for controlled tests.
Compound structure detection is not formalized: there is no task to decompose compound polyhedra into constituent solids or to reason about interpenetration/arrangement correctness.
Convexity/concavity detection is not benchmarked: despite highlighting concave forms, there is no formal convex vs nonconvex classification task nor self-intersection detection/penalty in reconstruction.
Probe training uses only synthetic data: symmetry probes are trained only on synthetic embeddings; the benefits of mixed or wild training and domain adaptation methods (e.g., CORAL, adversarial DA) are not evaluated.
Statistical rigor is limited: confidence intervals, repeated runs, probe-size/regularization ablations, and significance testing are not reported for the symmetry/MRT results.
Reconstruction target representation is unclear for nonconvex/compound shapes: evaluation does not specify expectations for self-intersections, manifoldness, or how compounds should be represented/reconstructed.
Single-viewpoint per classification: VLMs are shown single images; performance with multi-view or 3D-aware prompting (e.g., set of views, pose-annotated panels) is not tested.
Missing tasks on explicit geometric properties: there are no dedicated tasks for predicting face types, per-vertex valence, edge-length ratios, Euler characteristic, dual identification, or symmetry-group assignment from images.
Dataset scale may be insufficient for training: with 224 unique shapes, the benchmark is suitable for evaluation but small for training/fine-tuning; the effect of scaling via procedural generation or augmentation is an open question.
Reproducibility details are partial: the main text defers dataset splits and additional results to the supplementary; standardized code, splits, metrics, and versioned releases are needed for fully reproducible benchmarking.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed now, leveraging the GIQ dataset, the paper’s evaluation protocols, and the observed strengths (e.g., linear-probe symmetry detection) and weaknesses (e.g., reconstruction and mental rotation failures) of current models.

Geometry-aware model QA and CI gating
- Sectors: software, robotics, AR/VR, 3D scanning, e-commerce visualization
- What to do: Integrate GIQ test suites into model evaluation pipelines to gate releases. Track per-category performance (e.g., convex vs. nonconvex, symmetry groups, synthetic vs. real) and fail builds that regress on geometric reasoning.
- Tools/workflows: Batch inference on GIQ; automatic reports for monocular 3D reconstruction fidelity, linear-probe symmetry detection, and mental-rotation accuracy; CI dashboards showing synthetic-to-real generalization.
- Assumptions/dependencies: Benchmarks on polyhedra correlate with downstream spatial tasks; dataset licensing/attribution; reproducible preprocessing (cropping/background removal).
Plug-and-play symmetry detection module via linear probing
- Sectors: robotics (manipulation, bin picking), manufacturing (quality inspection), CAD/graphics, warehousing
- What to do: Use DINOv2 embeddings + a linear probe (trained on GIQ synthetic images) to detect 3D symmetry elements (central inversion, 4-fold, 5-fold) from single views. Use predicted symmetry to reduce pose ambiguity, plan grasps, or validate design constraints.
- Tools/workflows: Embed incoming frames; feed to trained linear layer; if symmetry is detected (e.g., 4-fold), restrict pose hypotheses/ICP initializations accordingly; flag mismatches in QC.
- Assumptions/dependencies: Current probes cover only a subset of symmetry types; trained on synthetic data but shown to generalize reasonably to real (wild) images—validate on in-domain parts.
Data curation and targeted pretraining augmentation
- Sectors: ML platform teams, foundation model developers
- What to do: Add GIQ-like synthetic polyhedra and auxiliary tasks (e.g., symmetry prediction, convexity detection) during pretraining/fine-tuning to improve 3D awareness; adopt curricula from simple (Platonic) to complex (stellations/compounds).
- Tools/workflows: Differentiable or PBR rendering; multi-task losses; stratified sampling by complexity and symmetry class.
- Assumptions/dependencies: Gains transfer beyond polyhedra; compute/bandwidth for synthetic data generation.
Risk guidance for VLM use in spatial/engineering contexts
- Sectors: enterprise AI governance, policy teams in regulated industries (healthcare devices, industrial automation)
- What to do: Issue internal usage policies noting VLMs’ systematic errors on geometry (convexity, face types, compounds). Restrict or require human verification for geometry-critical decisions (e.g., design review, tolerance checks).
- Tools/workflows: Model cards with GIQ metrics; red-team checklists; automatic routing to human-in-the-loop when geometric uncertainty is high.
- Assumptions/dependencies: Organizational adoption of AI governance; clear thresholds tied to GIQ scores.
Semi-automatic labeling of symmetry at scale
- Sectors: dataset creation (e.g., Objaverse-like corpora), academic labs, model vendors
- What to do: Use DINOv2+linear probes to pre-label symmetry attributes on large 2D/3D repositories; prioritize human review on borderline cases.
- Tools/workflows: Batch embedding and inference; uncertainty-based sampling for human validation.
- Assumptions/dependencies: Probe’s precision/recall is adequate for bootstrapping; efficient human QA loop.
Graphics/CAD quality checks from 2D views
- Sectors: CAD/PLM software, content marketplaces, 3D asset pipelines
- What to do: Render canonical views of meshes and run symmetry/convexity checks to detect modeling errors (e.g., unintended nonconvexities, missing symmetries).
- Tools/workflows: Offline renderer; image featurizer + linear probes; QC reports integrated into DCC toolchains.
- Assumptions/dependencies: Consistent view/render parameters; potential false positives for textured/occluded assets—validate with mesh-level checks when available.
Education and assessment in spatial reasoning
- Sectors: education (K–12, higher ed), training (architecture, engineering), cognitive psychology
- What to do: Deploy GIQ-based mental-rotation tasks and symmetry-identification exercises in curricula and studies; compare human vs. model performance.
- Tools/workflows: Web-based quizzes/apps; item banks stratified by geometric complexity and symmetry.
- Assumptions/dependencies: IRB/ethics for human studies; accessibility and fairness considerations.
Domain gap analysis for perception stacks
- Sectors: autonomy/robotics R&D, applied CV teams
- What to do: Use synthetic vs. wild GIQ splits to quantify synthetic-to-real generalization gaps for monocular reconstruction and embedding similarity; guide data augmentation and camera/lighting design choices.
- Tools/workflows: Paired evaluations; ablation studies on photometric augmentation; camera placement/lens selection experiments.
- Assumptions/dependencies: Polyhedra-induced gaps mirror gaps in target environments (e.g., reflective parts, clutter).

Long-Term Applications

These applications require further research, scaling, or engineering development to meet performance, robustness, or certification needs revealed by GIQ.

Geometry-aware foundation models (3D-consistent VLMs and vision encoders)
- Sectors: software, robotics, AR/VR, edge AI
- What to build: Models that explicitly encode 3D properties (symmetry groups, convexity, face/edge/vertex structure), with SO(3)-equivariance or object-centric 3D priors; VLMs that reliably reason about complex polyhedra and fine shape differences.
- Dependencies: Larger curated 3D corpora with geometric annotations; new training objectives (equivariance, geometric consistency, multi-view constraints); efficient inference.
Reliable single-image CAD reconstruction of polyhedral parts
- Sectors: reverse engineering, digital catalogs, e-commerce, manufacturing
- What to build: Pipelines that convert photos into clean, watertight polyhedral meshes with correct face types and symmetries, suitable for downstream CAD/CAM.
- Dependencies: Advances in single-view 3D reconstruction (GIQ shows current failures), robust material/lighting handling, geometric post-processing (planarization, snapping to symmetry).
Certified AI for spatially critical applications
- Sectors: healthcare (surgical planning/robots), industrial robotics, public safety
- What to build: Standardized certification suites (GIQ-like tasks plus domain-specific assets) that quantify geometric reasoning competency for procurement and compliance.
- Dependencies: Standards bodies and regulators; demonstrated correlation between benchmark scores and safety outcomes; test coverage beyond polyhedra.
AR/VR scene understanding with symmetry-aware alignment
- Sectors: AR authoring, industrial AR, gaming
- What to build: Real-time recognition of object symmetries to improve pose tracking, alignment/snapping, occlusion handling, and content anchoring on symmetric objects.
- Dependencies: On-device efficient symmetry detectors; robust performance under motion blur, occlusion, and clutter.
Materials science and crystallography assistance
- Sectors: materials, chemistry, geology
- What to build: Tools that infer 3D symmetry/space groups from 2D images (micrographs) or limited views; accelerate structure identification and hypothesis testing.
- Dependencies: Domain-specific datasets and labels; physics-informed constraints; validation against experimental pipelines.
Geometry-aware robotic manipulation and assembly
- Sectors: manufacturing, logistics, electronics assembly
- What to build: Perception-action stacks that exploit symmetry and convexity to resolve pose ambiguities, plan grasps for symmetric parts, and verify correct assembly of compounds/subassemblies.
- Dependencies: Tight integration with tactile/force sensing and pose estimators; formal guarantees under occlusions; real-time constraints.
Constrained generative 3D design tools
- Sectors: product design, architecture, 3D printing
- What to build: Generative models that respect explicit geometric constraints (symmetry groups, convexity, face regularity) for ideation, optimization, and manufacturability checks.
- Dependencies: Differentiable constraint enforcement; user-in-the-loop design workflows; metrics linking constraints to aesthetics/functional performance.
3D media forensics and integrity checks
- Sectors: security, media platforms, legal
- What to build: Detectors that flag impossible or inconsistent geometry (e.g., violated symmetries, nonphysical structures) in 3D content and rendered imagery.
- Dependencies: Generalization beyond polyhedra; datasets of manipulations; low false-positive rates in the wild.
Adaptive spatial reasoning tutors
- Sectors: education, professional training (pilots, surgeons)
- What to build: Personalized training systems that adjust difficulty based on performance on mental rotation and symmetry tasks, with transfer to real-world skills.
- Dependencies: Longitudinal efficacy studies; domain transfer validation; inclusive design across populations.
SLAM and mapping priors from symmetry
- Sectors: drones, mobile robotics, indoor mapping
- What to build: Incorporate symmetry cues into loop-closure and pose-graph optimization, particularly in texture-poor or repetitive environments.
- Dependencies: Algorithmic integration with geometric SLAM back-ends; robustness under partial views and dynamic scenes.

Cross-cutting assumptions and dependencies

Polyhedra as a proxy: While polyhedra offer clean, unambiguous ground truth and a controlled complexity ladder, some domains involve texture, deformability, or irregular topology not represented in GIQ. Validate transfer to your domain.
Synthetic-to-real gap: Linear probes trained on synthetic data showed promising transfer for specific symmetries, but generalization can degrade (as seen in mental rotation). Domain adaptation and data augmentation may be required.
Task coverage: Current probe tasks target central inversion and 4- and 5-fold rotations; broader symmetry groups and richer geometric attributes (e.g., edge/face counts, compound detection) will need expanded labels and models.
Compute and tooling: Rendering, pretraining, and CI-scale evaluation require reliable pipelines, storage, and inference acceleration; ensure reproducible preprocessing (cropping, background removal) to match benchmark conditions.

View Paper Prompt View All Prompts

Glossary

Adversarial perturbations: Small, often imperceptible input changes crafted to cause model errors. "adversarial perturbations"
Archimedean solids: Thirteen convex polyhedra with regular faces of multiple types arranged identically around each vertex. "Archimedean solids"
Balanced accuracy: Metric averaging true positive and true negative rates to handle class imbalance. "Balanced accuracy, computed as:"
Bidirectional Reflectance Distribution Function (BRDF): Function describing how light is reflected at an opaque surface. "a two-sided diffuse BRDF"
Catalan solids: Thirteen face-transitive (but not vertex-transitive) duals of the Archimedean solids. "Catalan solids"
Central point reflection: Symmetry under inversion through a center point. "central point reflection (invariance under inversion through a central point)"
CLIP: A vision-LLM aligning image and text embeddings via contrastive learning. "CLIP~\cite{radford2021learning}"
Compound polyhedra: Structures formed by the symmetric combination of multiple polyhedra. "compound polyhedra—structures formed by the symmetric combination of multiple polyhedra"
ConvNext: A convolutional architecture competitive with vision transformers. "ConvNext~\cite{liu2022convnet}"
Convex (polyhedron): A polyhedron whose every internal connecting segment lies entirely within the shape. "convex polyhedra"
DeiT III: A data-efficient image transformer variant trained with strong augmentation and distillation. "DeiT III~\cite{touvron2022deit}"
Diffuse shading: Rendering that models surfaces as reflecting light uniformly in all directions. "the object is rendered using diffuse shading."
Diffusion-based generative model: A model that learns to generate data via iterative denoising. "a diffusion-based generative model"
Direct integrator: A renderer integrator that accounts only for direct lighting, excluding global illumination. "by using a direct integrator"
DINO: A self-supervised vision transformer trained with knowledge distillation. "DINO~\cite{caron2021emerging}"
DINOv2: A strong self-supervised image featurizer capturing robust visual structure. "DINOv2 consistently delivered superior performance across symmetry categories"
DreamSim: An embedding model for perceptual similarity between images. "DreamSim~\cite{fu2023dreamsim}"
Dual polyhedra: Paired polyhedra where vertices of one correspond to faces of the other. "Dual polyhedra are pairs of polyhedra where vertices and faces are interchanged"
Edge-transitive: Symmetry property where any edge can be mapped to any other edge. "edge-transitive"
Environment emitter: A uniform light source illuminating a scene from all directions. "a constant environment emitter"
Euler characteristic: Topological invariant of polyhedra defined as χ = V − E + F. "Euler characteristic formula"
Face-transitive: Symmetry property where any face can be mapped to any other face. "face-transitive"
Far clipping plane: The maximum renderable distance from the camera. "a far clipping plane of $10^8$ "
Featurizer: A model producing embeddings used for downstream tasks. "featurizers"
Foundation models: Large pretrained models that generalize across tasks and domains. "foundation models effectively detect specific 3D symmetry elements via linear probing"
Global illumination: Rendering that accounts for indirect light bounces and interreflections. "We disable global illumination"
Google Scanned Objects (GSO): A dataset of scanned real-world 3D objects. "Google Scanned Objects (GSO)~\cite{downs2022google}"
Implicit neural representations: Continuous functions parameterized by neural networks to represent geometry or signals. "implicit neural representations"
Johnson solids: Ninety-two strictly convex polyhedra with regular faces but non-uniform vertex configurations. "Johnson solids"
Kepler–Poinsot solids: Four nonconvex regular star polyhedra. "Kepler-Poinsot solids"
Large Reconstruction Model (LRM): A transformer that predicts a 3D radiance field from a single image. "Large Reconstruction Model (LRM)"
Linear probe: A single linear classifier trained on fixed embeddings to test encoded information. "effectively performing a linear probe"
Low-discrepancy samples: Stratified sampling sequences that reduce variance in Monte Carlo rendering. "1024 low-discrepancy samples per pixel"
Masked AutoEncoder (MAE): A self-supervised method reconstructing masked image patches to learn representations. "Masked AutoEncoder (MAE)~\cite{he2022masked}"
Mental Rotation Test (MRT): Assessment of spatial reasoning by recognizing rotated objects as identical. "The Mental Rotation Test (MRT), first proposed by Shepard and Metzler"
Mitsuba physically-based renderer: A research-oriented renderer for accurate light transport simulation. "Mitsuba physically-based renderer"
Monocular 3D reconstruction: Recovering 3D shape from a single image. "Monocular 3D Reconstruction"
MVImgNet: A multi-view image dataset for training reconstruction models. "MVImgNet"
Near clipping plane: The minimum renderable distance from the camera. "a near clipping plane of $10^{-3}$ "
Neural radiance field (NeRF): A continuous volumetric representation modeling view-dependent color and density. "predict a neural radiance field from single images"
Nonconvex: Shapes containing indentations where some connecting segments lie outside the surface. "nonconvex"
NAVI: A dataset of 3D objects for analyzing spatial reasoning. "NAVI~\cite{jampani2023navi}"
Objaverse: A large-scale collection of diverse 3D assets for vision research. "Objaverse~\cite{deitke2023objaversee}"
Objaverse XL: An expanded successor to Objaverse with more assets. "Objaverse XL~\cite{deitke2023objaverse}"
OmniObject3D: A dataset of synthetic and real objects for 3D analysis. "OmniObject3D~\cite{wu2023omniobject3d}"
OpenLRM: An open implementation and training of LRM. "OpenLRM"
Out-of-distribution data: Inputs that differ significantly from a model’s training distribution. "out-of-distribution data"
Perspective camera: A camera model that projects 3D points with perspective foreshortening. "We use a perspective camera"
Platonic solids: The five regular convex polyhedra with congruent faces and identical vertices. "Platonic solids"
Polar reciprocation: A geometric construction producing duals by reciprocation with respect to a sphere. "polar reciprocation"
Radiance: Power per unit area per unit solid angle emitted by a source. "a radiance of (0.3, 0.3, 0.3)"
SAM: Sharpness-Aware Minimization used in training vision models. "SAM~\cite{foret2020sharpness} "
Scene registration: Estimating spatial alignment between scenes or objects. "scene registration"
Self-supervised: Training without labels by leveraging intrinsic data structure. "self-supervised transformer-based methods"
Shap-E: A diffusion-based 3D generative model with implicit representations. "Shap-E \cite{jun2023shap}"
SigLip: A vision-LLM trained with a sigmoid-based loss. "SigLip~\cite{zhai2023sigmoid}"
Stellations: Forms created by extending faces or edges of a polyhedron until they intersect. "stellations"
Stable Fast 3D: A transformer-based single-image 3D reconstruction model. "Stable Fast 3D"
Symmetry groups: Sets of transformations (e.g., rotations/reflections) leaving an object invariant. "symmetry groups (e.g., tetrahedral, octahedral, icosahedral)"
Transformer-based architecture: Neural network design using self-attention mechanisms. "transformer-based architecture"
Triplane-based shape representations: 3D encoding via features on three orthogonal planes. "triplane-based shape representations"
Uniform non-convex: Nonconvex polyhedra with regular faces and vertex transitivity. "Uniform non-convex"
Uniform polyhedron: Polyhedron with regular polygonal faces and identical vertex configurations. "A uniform polyhedron has polygonal faces that are regular polygons and is vertex-transitive"
Vertex-transitive: Symmetry property mapping any vertex to any other via a symmetry operation. "vertex-transitive"
Vision-LLMs (VLMs): Models jointly processing images and text for multimodal tasks. "vision-LLMs (VLMs)"
Weighted binary cross-entropy loss: Class-imbalanced loss scaling positives by a computed weight per class. "we employed a weighted binary cross-entropy loss."
Wild images: Real-world photographs with uncontrolled conditions. "wild images"
Zero-shot classification: Recognizing classes without task-specific training examples. "zero-shot polyhedron classification"

GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

Summary

GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

Evaluation and Findings

Implications and Future Directions

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions

How They Did It

Monocular 3D Reconstruction

3D Symmetry Detection

Mental Rotation Test

Zero-Shot Classification

Main Findings and Why They Matter

Implications and Impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Authors (7)

Collections

YouTube

GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

Summary

GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra

Evaluation and Findings

Implications and Future Directions

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Overview

Key Questions

How They Did It

Monocular 3D Reconstruction

3D Symmetry Detection

Mental Rotation Test

Zero-Shot Classification

Main Findings and Why They Matter

Implications and Impact

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Related Papers

Authors (7)

Collections

YouTube