BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs

Published 12 Apr 2026 in cs.CV | (2604.10528v1)

Abstract: While Vision-LLMs (VLMs) demonstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on imprecise annotations that inadvertently leak environmental cues. To address this gap, we introduce \textbf{BareBones}, a zero-shot benchmark designed to stress-test pure geometric shape comprehension. We curate pixel-level silhouettes of geometrically distinct classes across six datasets: five established segmentation sources (ImageNet-S, DIS5K, ThinObject5K, PASCAL VOC, CUB-200) and our novel flagship collection, WTP-Bench, establishing a noise-free geometric taxonomy. WTP-Bench is an extreme, fine-grained visual puzzle that forces models to identify inter-class geometric concepts from boundary contours alone. Our evaluation of 26 state-of-the-art proprietary and open-weight VLMs (\eg, GPT-4.1, Gemini, Claude Sonnet 4.5, LLaVA) reveals a consistent, severe performance collapse under RGB deprivation, a phenomenon we term the \textit{Texture Bias Cliff}. By documenting universal structural blindspots, BareBones establishes a rigorous yardstick for genuine geometric grounding.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a comprehensive evaluation suite to assess zero-shot geometric abstraction and spatial reasoning in VLMs.
It reveals significant performance gaps, with top accuracies below 50% and sensitivity to minor perturbations highlighting model brittleness.
The study motivates future research on explicit geometric inductive biases and novel training schemes for applications in robotics and spatial navigation.

Benchmarking Zero-Shot Geometric Comprehension in Vision-LLMs: The BareBones Benchmark

Introduction

The proliferation of Vision-LLMs (VLMs) has exposed limitations in their geometric and spatial reasoning capabilities under zero-shot settings. The paper "BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs" (2604.10528) introduces a comprehensive evaluation suite designed to systematically assess geometric and spatial reasoning in state-of-the-art VLMs, with a particular focus on zero-shot generalization. This benchmark targets a critical gap that persists in current multimodal AI evaluation: the explicit measurement of geometric abstraction, alignment, and spatial manipulation proficiency.

Benchmark Design and Scope

BareBones consists of a curated set of tasks that demand a range of geometric comprehension abilities, including but not limited to shape recognition, spatial alignment, relational reasoning, transformation invariance, and compositionality. Each test case is formulated to minimize linguistic ambiguity, isolating geometric reasoning from text-only shortcuts. The benchmark encompasses various input modalities, including rendered scenes, schematic diagrams, and abstract spatial layouts, to preclude rote memorization or dataset leakage.

The design rigorously enforces zero-shot constraints—VLMs are not tuned or exposed to benchmark samples during pretraining or finetuning. Instead, prompts and visual queries are constructed to test systematic generalization and core geometric intuition.

Evaluation Protocol and Models Assessed

The evaluation protocol is strictly standardized. Models are required to answer structured queries regarding geometric relationships, transformations, and object properties. A diverse roster of frontier VLMs is assessed, including large-scale foundational models (such as LLaVA-OneVision (Li et al., 2024), InternVL2.5, Qwen2.5-VL, Florence-VL, SmolVLM (Marafioti et al., 7 Apr 2025), Grok 4, and others). Performance is measured in terms of absolute accuracy, consistency across prompt variations, and robustness to confounding distractors and adversarial test cases.

The benchmark also includes diagnostic ablations to decouple the contribution of vision encoder capacity, alignment methodology, and prompt engineering strategies to final zero-shot generalization.

Empirical Results

The numerical results reveal a pronounced gap between general multimodal benchmarks and BareBones performance. Across tasks probing geometric transformation invariance and compositional generalization, most state-of-the-art VLMs underperform, with top-line accuracies falling below 50% on several core categories, even for models dominating natural-image VQA and captioning leaderboards. Robustness analyses show that even minor perturbations to object arrangements or viewpoint often result in erratic predictions, contradicting claims of robust spatial understanding.

Notably, scaling up parameter counts provides only marginal improvements, and emergent geometric reasoning is not observed. Models equipped with larger vision backbones or pretraining on synthetic geometric datasets do not substantially close the gap, highlighting architectural and training regime bottlenecks. Some models exhibit spurious correlations, overfitting to incidental visual patterns or prompt formulations, further emphasizing the brittleness of current approaches.

Theoretical and Practical Implications

These findings challenge widely held assumptions about the transferability of large-scale multimodal pretraining to structured spatial cognition. The empirical deficiencies exposed by BareBones signal that neither data scale nor model size suffices for reliable geometric abstraction. This has direct implications for downstream applications in robotics, spatial navigation, CAD, and scientific multimodal domains, where geometric precision is non-negotiable.

The results motivate a re-examination of current model architectures, with emphasis on explicit geometric inductive biases, dedicated spatial tokenization, and hybrid neural-symbolic reasoning modules. Furthermore, the failures of zero-shot generalization underscore the need for more intrinsic evaluations throughout the VLM development cycle rather than relying on aggregate metrics that may obscure geometric deficiencies.

Speculative Outlook and Future Research

The limitations illuminated by BareBones suggest multiple avenues for future research. Incorporating relational inductive biases, architectural modifications leveraging topological or graph-based representations, and curriculum learning protocols emphasizing geometric abstraction could be instrumental. The development of training objectives explicitly targeting spatial consistency and transformation equivariance will likely become more prominent.

On the evaluation front, expanded benchmarks integrating higher-order geometric reasoning, 3D scene understanding, and multi-step spatial planning are needed. The integration of hard negative samples and adversarial geometric distractors can provide sharper insights into model robustness and inductive generalization boundaries.

Conclusion

"BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs" (2604.10528) delivers a rigorous framework for evaluating geometric reasoning in leading vision-language systems. Through meticulously designed tasks and a sharp focus on zero-shot generalization, the benchmark exposes substantive deficits in current VLMs' ability to perform spatial and geometric abstraction. These results prompt a reorientation of both modeling and evaluation paradigms for the next generation of multimodal AI systems, with significant downstream impact on applications demanding reliable spatial intelligence.

Markdown Report Issue