Spatial Reasoning Benchmarks
- Spatial reasoning benchmarks are systematic evaluations that assess computational agents’ abilities to perceive, represent, and manipulate spatial structures and transformations in multi-dimensional settings.
- They employ rigorous methodologies like procedural generation, expert annotation, and simulation to classify tasks across perceptual, geometric, and planning dimensions.
- Insights from these benchmarks reveal key model limitations, guiding advances in embodied AI, robotics, and scene understanding.
Spatial Reasoning Benchmarks
Spatial reasoning benchmarks provide systematic, multi-task evaluations of computational agents’ abilities to perceive, represent, and manipulate spatial relations, structure, transformations, and trajectories in 2D, 3D, and 4D settings. These benchmarks are designed to probe distinct facets of spatial intelligence—ranging from low-level perceptual grounding to high-level causal inference and planning—using controlled datasets, diverse task taxonomies, and rigorously defined performance metrics. They are vital for revealing persistent limitations in contemporary LLMs, vision-LLMs (VLMs), and multimodal LLMs (MLLMs), and serve as the basis for progress in embodied AI, robotics, scene understanding, and agentic systems.
1. Taxonomies of Spatial Reasoning Abilities
Benchmarks have formalized a range of taxonomic frameworks to partition spatial cognition hierarchically or quadrantically.
- Hierarchical Cognitive Levels
- Observation: Object enumeration, attribute extraction (Xu et al., 26 Nov 2025).
- Topology & Relations: Adjacency, containment, relative position, and temporal ordering (Xu et al., 26 Nov 2025, Stogiannidis et al., 25 Mar 2025).
- Symbolic Reasoning: Mapping spatial cues to abstract rules, multi-hop inference (Xu et al., 26 Nov 2025).
- Causality: Predicting outcomes under hypothetical movement or interaction (Xu et al., 26 Nov 2025).
- Planning: Synthesizing sequences to achieve spatial goals (Xu et al., 26 Nov 2025, Gao et al., 8 Jun 2026).
- DISE Quadrants (Spatial-DISE)
- Intrinsic-Static: Reasoning over internal object structure (e.g., which face painted blue).
- Intrinsic-Dynamic: Predicting effects of intra-object transformations (folding, rotating).
- Extrinsic-Static: External relations among objects (projection, view-based correspondence).
- Extrinsic-Dynamic: Multi-object, transformation-changing relations (assembly, multi-step manipulation) (Huang et al., 15 Oct 2025).
Benchmarks such as SpatialBench (Xu et al., 26 Nov 2025), Spatial-DISE (Huang et al., 15 Oct 2025), and GamiBench (Spencer et al., 22 Dec 2025) explicitly operationalize these taxonomies in their dataset construction and task design, facilitating both fine-grained skill attribution and unified capability metrics.
2. Benchmark Construction Methodologies
Spatial reasoning benchmarks employ a range of rigorous methodologies optimized for reproducibility, coverage, and diagnostic power.
- Automated Procedural Generation: Synthetic scene generation with controlled object placement, geometry, and distractor crafting, using engines like Blender (Spatial457 (Wang et al., 12 Feb 2025), Spatial-DISE (Huang et al., 15 Oct 2025), SIRI-Bench (Song et al., 17 Jun 2025)).
- Expert-Annotated Reasoning Chains: Canonical solution steps annotated for logical dependency tracking, supporting process-level evaluation (DynaSolidGeo (Wu et al., 25 Oct 2025)).
- Human-Validated Naturalistic Data: Real images or videos from datasets such as LSUN, COCO, GQA, AI2-THOR, ScanNet, and field data, annotated for reference frames and hierarchical phenomena (SpatialText (Jiang et al., 3 Mar 2026), SpatialWorld (Gao et al., 8 Jun 2026), CityCube (Xu et al., 20 Jan 2026)).
- Simulation Environments: Integration of multiple backends and agent interfaces to probe interactive, sequential, or agentic tasks under partial observability (SpatialWorld (Gao et al., 8 Jun 2026), EvoEmpirBench (Zhao et al., 16 Sep 2025)).
These methods enable large-scale, balanced, and robust datasets supporting both passive and interactive task formats. Controlled distractor generation and multi-view or multi-step setups are standard to preclude superficial pattern matching.
3. Task Typologies and Diagnostic Sub-abilities
Benchmarks comprehensively span sub-skills to dissect spatial competence:
- Perceptual and Relational Tasks: Primitive object identification, 2D/3D localization, spatial relation extraction (GSR-Bench (Rajabi et al., 2024), Spatial457 (Wang et al., 12 Feb 2025), Spatial Reasoning in Foundation Models (Mirjalili et al., 26 Sep 2025)).
- Geometric Manipulation: Mental rotation, spatial visualization, cross-sectional inference, origami folding, and shape reconstruction (SpatialViz-Bench (Wang et al., 10 Jul 2025), SpinBench (Zhang et al., 29 Sep 2025), GamiBench (Spencer et al., 22 Dec 2025)).
- Physical Constraints and Dynamics: Reasoning under extrinsic and intrinsic-dynamic transformations; enforcing occlusion, support, and contact constraints (SSI-Bench (Yang et al., 8 Feb 2026), DynaSolidGeo (Wu et al., 25 Oct 2025)).
- Sequential and Agentic Planning: Multi-step pathfinding, navigation, and active exploration under partial observability (SpatialWorld (Gao et al., 8 Jun 2026), GRASP (Tang et al., 2024), EvoEmpirBench (Zhao et al., 16 Sep 2025)).
- 4D Spatiotemporal Cognition: Memory, action recognition, state-change detection, and prediction in video (Spatial4D-Bench (Wang et al., 31 Dec 2025)).
- Pure-Text Spatial Reasoning: Mental modeling from text alone, disentangling visual pattern-matching (SpatialText (Jiang et al., 3 Mar 2026)).
Benchmarks have instantiated specialized metrics beyond raw accuracy, e.g., Viewpoint Consistency (VC) and Impossible Fold Selection Rate (IFSR) in GamiBench (Spencer et al., 22 Dec 2025), Relative Performance Dropping Rate (RPDR) in Spatial457 (Wang et al., 12 Feb 2025), and process-qualified accuracy in DynaSolidGeo (Wu et al., 25 Oct 2025).
4. Empirical Findings and Performance Stratification
Systematic evaluation across dozens of state-of-the-art open-source and proprietary models reveals consistent trends and bottlenecks:
| Benchmark | Human Best | Top Model | Open Model | Random | Notable Failure Modes |
|---|---|---|---|---|---|
| SpatialBench | 96.4% overall | Gemini-2.5-p | Qwen3-VL-235B | n/a | Symbolic (L3), Causal (L4), Planning (L5) |
| GSR-Bench | >90% Subset A | LLaVA-NeXT | Qwen1.5-110B | ~25% | Behind/in-front relation, small objects |
| CityCube | 88.3% | Doubao-1.6 | GLM-4.1V-9B | 22.8% | Cross-view, scale, egocentric rotation |
| SpinBench | 91.2% | InternVL3-38 | InternVL-14B | n/a | Mental/persp. rotation, viewpoint change |
| SpatialViz-Bench | ~95% | Gemini-2.5-p | LLama-4-Scout | 25–27% | 3D folding, animation, formulaic bias |
| DynaSolidGeo | n/a | GPT-5 | Qwen3-VL-30B | n/a | Visual perception, logic, hallucination |
| Spatial457 | n/a | GPT-4o | InternVL2 8B | n/a | 3D pose, depth, collision (6D) |
| Spatial-DISE | 76.8% | Doubao1.5VL | InternVL-3 | 25% | Multi-step, multi-view reasoning |
| EarthSpatialBench | F1≈ 0.91 (Within) | Gemini-2.5-p | Qwen3-VL-T-30 | n/a | Visual grounding, composite geometry |
| SIRI-Bench | ~70% (<60% error) | Doubao-1.5-p | Qwen2.5-VL-72 | n/a | Parameter extraction from video |
| SpatialWorld | n/a | GPT-5 (17.4%) | Qwen-3.5 (14.1%) | n/a | Partial observability, long-horizon plan |
| EvoEmpirBench | n/a | - | - | n/a | Local memory, dynamic state update |
Across all benchmarks, performance decays markedly as tasks move from static, single-image perception (object detection, simple relations) to high-dimensional, dynamic, multi-step, and cross-perspective reasoning (mental rotation, planning, causal inference, spatiotemporal prediction). Even top proprietary models routinely trail human accuracy by 20–50 percentage points on composite tasks (Xu et al., 26 Nov 2025, Spencer et al., 22 Dec 2025, Xu et al., 20 Jan 2026, Zhao et al., 16 Sep 2025).
5. Failure Modes and Diagnostic Insights
Benchmarks systematically expose persistent model deficiencies, which cluster into:
- Egocentric/Reference Frame Bias: Overcommitment to observer-centric frames; failures under allocentric or perspective-shifted queries (Zhang et al., 29 Sep 2025, Jiang et al., 3 Mar 2026).
- Rotational and Mental Simulation Gaps: Chance-level performance on dynamic/mental rotation, folding, cross-sectional reasoning, animation, and long-range transformations (Zhang et al., 29 Sep 2025, Spencer et al., 22 Dec 2025, Wang et al., 10 Jul 2025).
- Visual Plausibility Bias & Shortcutting: Preference for coherent yet incorrect visual patterns; overreliance on surface cues and object priors (Spencer et al., 22 Dec 2025, Xu et al., 20 Jan 2026).
- Degraded Planning under Partial Observability: Inefficient exploration and lack of recovery from errors in agentic tasks; trial-and-error search rather than strategic execution (Gao et al., 8 Jun 2026, Tang et al., 2024, Zhao et al., 16 Sep 2025).
- Combinatorial and Dynamic Complexity Collapse: Near-random performance on multi-view, multi-step, and sim-to-real generalization tasks involving several spatial factors or real-world grounded transformations (Wang et al., 12 Feb 2025, Huang et al., 15 Oct 2025, Wu et al., 25 Oct 2025, Xu et al., 20 Jan 2026).
- Hallucination and Logical Inconsistency: In process-evaluation setups, frequent invalid inferences, unjustified leaps, or contradictions in reasoning chains (Wu et al., 25 Oct 2025, Jiang et al., 3 Mar 2026).
Scaling model size, instruction tuning, and chain-of-thought prompting yield only modest improvements for high-complexity tasks; numeric gains are predominantly in perception and simple relational categories (Xu et al., 26 Nov 2025, Spencer et al., 22 Dec 2025, Stogiannidis et al., 25 Mar 2025).
6. Directions for Benchmark and Model Innovation
Identified gaps motivate several directions:
- Explicit Geometric and Physics Modules: Integration of 3D pose/orientation estimators, physics engines, and spatial graph architectures (Wang et al., 12 Feb 2025, Wu et al., 25 Oct 2025, Yang et al., 8 Feb 2026).
- Programmatic, Multi-modal Testing: Composition of numeric/geometric input (bboxes, polylines) with raster and natural-language contexts for precise, multifaceted evaluation (Xu et al., 17 Feb 2026, Xu et al., 26 Nov 2025).
- Process-oriented Metrics: Judgement of reasoning chains, process-tracing, and actionable sequence synthesis rather than only answer accuracy (Wu et al., 25 Oct 2025, Anand et al., 23 Dec 2025).
- Interactive, Task-driven Evaluation: Shift from passive VQA paradigms to agentic, partially observable, iterative-action tasks coupled to real-world simulators (Gao et al., 8 Jun 2026, Zhao et al., 16 Sep 2025).
- Sim-to-Real and Multi-environment Generalization: Extension of task variety, environment complexity, and observability structure to stress sim-to-real robustness (Gao et al., 8 Jun 2026, Huang et al., 15 Oct 2025).
- Cognitive and Processual Grounding: Taxonomies and constructions explicitly grounded in human cognitive science and psychometric frameworks (Xu et al., 26 Nov 2025, Wang et al., 10 Jul 2025, Jiang et al., 3 Mar 2026).
Collectively, these future-oriented approaches aim for models that can “think in space”—not just recognize spatial features, but robustly simulate, plan, and reason about geometric, topological, and physical constraints across modalities, views, and horizons.
7. Representative Benchmarks: Summary Table
| Benchmark | Modalities | Categories/Taxonomy | Key Diagnostic Features | arXiv ID |
|---|---|---|---|---|
| SpatialBench | Multi-modal | 5-level cognitive | Unified metric, 15 tasks, L₅ planning | (Xu et al., 26 Nov 2025) |
| GamiBench | Visual | 2D→3D planning | Origami: cross-view, physical feasibility | (Spencer et al., 22 Dec 2025) |
| GSR-Bench | Visual | Relations, grounding | CircularEval, mask/depth, scaling laws | (Rajabi et al., 2024) |
| Spatial457 | Visual | 6D spatial | Level-by-level, unbiased attribute, RPDR | (Wang et al., 12 Feb 2025) |
| SpinBench | Visual | Perspective/rotation | Egocentric/allocentric, 51 subtypes | (Zhang et al., 29 Sep 2025) |
| SpatialViz-Bench | Visual | 4 visualization skills | 12 tasks: rotation, folding, animation | (Wang et al., 10 Jul 2025) |
| DynaSolidGeo | Multimodal | Solid geometry | Dynamic instance gen, process evaluation | (Wu et al., 25 Oct 2025) |
| EarthSpatialBench | Geo-visual | Distance, topology | Polygons, polylines, quantitative tasks | (Xu et al., 17 Feb 2026) |
| CityCube | Visual | Cross-view | Urban, rotation/orbit, 5 cognitive dims | (Xu et al., 20 Jan 2026) |
| SIRI-Bench | Video | 3D math, perception | Video-based, multi-step, automatic gen | (Song et al., 17 Jun 2025) |
| Spatial-DISE | Visual | DISE quadrants | Multi-view, multi-step reasoning | (Huang et al., 15 Oct 2025) |
| SpatialWorld | Interactive | POMDP, planning | Text-action, 8 env backends, TSR/effic. | (Gao et al., 8 Jun 2026) |
| EvoEmpirBench | Agent | Dynamic, experience | Long-horizon, local obs, experience mem | (Zhao et al., 16 Sep 2025) |
| SpatialText | Text-only | 5-level, dual source | Human+synthetic, mental modeling | (Jiang et al., 3 Mar 2026) |
| SSI-Bench | Visual | Constrained manifold | Real 3D structures, ranking, physics | (Yang et al., 8 Feb 2026) |
| Spatial4D-Bench | Video+img | 6 cognitive domains | ~40k QA, spatiotemporal, physical law | (Wang et al., 31 Dec 2025) |
These benchmarks collectively constitute the state-of-the-art in evaluating and dissecting spatial reasoning in AI systems and inform the specification of next-generation, spatially-aware models and agentic architectures.