Spatial Reasoning Benchmarks

Updated 26 June 2026

Spatial reasoning benchmarks are systematic evaluations that assess computational agents’ abilities to perceive, represent, and manipulate spatial structures and transformations in multi-dimensional settings.
They employ rigorous methodologies like procedural generation, expert annotation, and simulation to classify tasks across perceptual, geometric, and planning dimensions.
Insights from these benchmarks reveal key model limitations, guiding advances in embodied AI, robotics, and scene understanding.

Spatial reasoning benchmarks provide systematic, multi-task evaluations of computational agents’ abilities to perceive, represent, and manipulate spatial relations, structure, transformations, and trajectories in 2D, 3D, and 4D settings. These benchmarks are designed to probe distinct facets of spatial intelligence—ranging from low-level perceptual grounding to high-level causal inference and planning—using controlled datasets, diverse task taxonomies, and rigorously defined performance metrics. They are vital for revealing persistent limitations in contemporary LLMs, vision-LLMs (VLMs), and multimodal LLMs (MLLMs), and serve as the basis for progress in embodied AI, robotics, scene understanding, and agentic systems.

1. Taxonomies of Spatial Reasoning Abilities

Benchmarks have formalized a range of taxonomic frameworks to partition spatial cognition hierarchically or quadrantically.

Hierarchical Cognitive Levels
- Observation: Object enumeration, attribute extraction (Xu et al., 26 Nov 2025).
- Topology & Relations: Adjacency, containment, relative position, and temporal ordering (Xu et al., 26 Nov 2025, Stogiannidis et al., 25 Mar 2025).
- Symbolic Reasoning: Mapping spatial cues to abstract rules, multi-hop inference (Xu et al., 26 Nov 2025).
- Causality: Predicting outcomes under hypothetical movement or interaction (Xu et al., 26 Nov 2025).
- Planning: Synthesizing sequences to achieve spatial goals (Xu et al., 26 Nov 2025, Gao et al., 8 Jun 2026).
DISE Quadrants (Spatial-DISE)
- Intrinsic-Static: Reasoning over internal object structure (e.g., which face painted blue).
- Intrinsic-Dynamic: Predicting effects of intra-object transformations (folding, rotating).
- Extrinsic-Static: External relations among objects (projection, view-based correspondence).
- Extrinsic-Dynamic: Multi-object, transformation-changing relations (assembly, multi-step manipulation) (Huang et al., 15 Oct 2025).

Benchmarks such as SpatialBench (Xu et al., 26 Nov 2025), Spatial-DISE (Huang et al., 15 Oct 2025), and GamiBench (Spencer et al., 22 Dec 2025) explicitly operationalize these taxonomies in their dataset construction and task design, facilitating both fine-grained skill attribution and unified capability metrics.

2. Benchmark Construction Methodologies

Spatial reasoning benchmarks employ a range of rigorous methodologies optimized for reproducibility, coverage, and diagnostic power.

Automated Procedural Generation: Synthetic scene generation with controlled object placement, geometry, and distractor crafting, using engines like Blender (Spatial457 (Wang et al., 12 Feb 2025), Spatial-DISE (Huang et al., 15 Oct 2025), SIRI-Bench (Song et al., 17 Jun 2025)).
Expert-Annotated Reasoning Chains: Canonical solution steps annotated for logical dependency tracking, supporting process-level evaluation (DynaSolidGeo (Wu et al., 25 Oct 2025)).
Human-Validated Naturalistic Data: Real images or videos from datasets such as LSUN, COCO, GQA, AI2-THOR, ScanNet, and field data, annotated for reference frames and hierarchical phenomena (SpatialText (Jiang et al., 3 Mar 2026), SpatialWorld (Gao et al., 8 Jun 2026), CityCube (Xu et al., 20 Jan 2026)).
Simulation Environments: Integration of multiple backends and agent interfaces to probe interactive, sequential, or agentic tasks under partial observability (SpatialWorld (Gao et al., 8 Jun 2026), EvoEmpirBench (Zhao et al., 16 Sep 2025)).

These methods enable large-scale, balanced, and robust datasets supporting both passive and interactive task formats. Controlled distractor generation and multi-view or multi-step setups are standard to preclude superficial pattern matching.

3. Task Typologies and Diagnostic Sub-abilities

Benchmarks comprehensively span sub-skills to dissect spatial competence:

Perceptual and Relational Tasks: Primitive object identification, 2D/3D localization, spatial relation extraction (GSR-Bench (Rajabi et al., 2024), Spatial457 (Wang et al., 12 Feb 2025), Spatial Reasoning in Foundation Models (Mirjalili et al., 26 Sep 2025)).
Geometric Manipulation: Mental rotation, spatial visualization, cross-sectional inference, origami folding, and shape reconstruction (SpatialViz-Bench (Wang et al., 10 Jul 2025), SpinBench (Zhang et al., 29 Sep 2025), GamiBench (Spencer et al., 22 Dec 2025)).
Physical Constraints and Dynamics: Reasoning under extrinsic and intrinsic-dynamic transformations; enforcing occlusion, support, and contact constraints (SSI-Bench (Yang et al., 8 Feb 2026), DynaSolidGeo (Wu et al., 25 Oct 2025)).
Sequential and Agentic Planning: Multi-step pathfinding, navigation, and active exploration under partial observability (SpatialWorld (Gao et al., 8 Jun 2026), GRASP (Tang et al., 2024), EvoEmpirBench (Zhao et al., 16 Sep 2025)).
4D Spatiotemporal Cognition: Memory, action recognition, state-change detection, and prediction in video (Spatial4D-Bench (Wang et al., 31 Dec 2025)).
Pure-Text Spatial Reasoning: Mental modeling from text alone, disentangling visual pattern-matching (SpatialText (Jiang et al., 3 Mar 2026)).

Benchmarks have instantiated specialized metrics beyond raw accuracy, e.g., Viewpoint Consistency (VC) and Impossible Fold Selection Rate (IFSR) in GamiBench (Spencer et al., 22 Dec 2025), Relative Performance Dropping Rate (RPDR) in Spatial457 (Wang et al., 12 Feb 2025), and process-qualified accuracy in DynaSolidGeo (Wu et al., 25 Oct 2025).

4. Empirical Findings and Performance Stratification

Systematic evaluation across dozens of state-of-the-art open-source and proprietary models reveals consistent trends and bottlenecks:

Benchmark	Human Best	Top Model	Open Model	Random	Notable Failure Modes
SpatialBench	96.4% overall	Gemini-2.5-p	Qwen3-VL-235B	n/a	Symbolic (L3), Causal (L4), Planning (L5)
GSR-Bench	>90% Subset A	LLaVA-NeXT	Qwen1.5-110B	~25%	Behind/in-front relation, small objects
CityCube	88.3%	Doubao-1.6	GLM-4.1V-9B	22.8%	Cross-view, scale, egocentric rotation
SpinBench	91.2%	InternVL3-38	InternVL-14B	n/a	Mental/persp. rotation, viewpoint change
SpatialViz-Bench	~95%	Gemini-2.5-p	LLama-4-Scout	25–27%	3D folding, animation, formulaic bias
DynaSolidGeo	n/a	GPT-5	Qwen3-VL-30B	n/a	Visual perception, logic, hallucination
Spatial457	n/a	GPT-4o	InternVL2 8B	n/a	3D pose, depth, collision (6D)
Spatial-DISE	76.8%	Doubao1.5VL	InternVL-3	25%	Multi-step, multi-view reasoning
EarthSpatialBench	F1≈ 0.91 (Within)	Gemini-2.5-p	Qwen3-VL-T-30	n/a	Visual grounding, composite geometry
SIRI-Bench	~70% (<60% error)	Doubao-1.5-p	Qwen2.5-VL-72	n/a	Parameter extraction from video
SpatialWorld	n/a	GPT-5 (17.4%)	Qwen-3.5 (14.1%)	n/a	Partial observability, long-horizon plan
EvoEmpirBench	n/a	-	-	n/a	Local memory, dynamic state update

Across all benchmarks, performance decays markedly as tasks move from static, single-image perception (object detection, simple relations) to high-dimensional, dynamic, multi-step, and cross-perspective reasoning (mental rotation, planning, causal inference, spatiotemporal prediction). Even top proprietary models routinely trail human accuracy by 20–50 percentage points on composite tasks (Xu et al., 26 Nov 2025, Spencer et al., 22 Dec 2025, Xu et al., 20 Jan 2026, Zhao et al., 16 Sep 2025).

5. Failure Modes and Diagnostic Insights

Benchmarks systematically expose persistent model deficiencies, which cluster into:

Egocentric/Reference Frame Bias: Overcommitment to observer-centric frames; failures under allocentric or perspective-shifted queries (Zhang et al., 29 Sep 2025, Jiang et al., 3 Mar 2026).
Rotational and Mental Simulation Gaps: Chance-level performance on dynamic/mental rotation, folding, cross-sectional reasoning, animation, and long-range transformations (Zhang et al., 29 Sep 2025, Spencer et al., 22 Dec 2025, Wang et al., 10 Jul 2025).
Visual Plausibility Bias & Shortcutting: Preference for coherent yet incorrect visual patterns; overreliance on surface cues and object priors (Spencer et al., 22 Dec 2025, Xu et al., 20 Jan 2026).
Degraded Planning under Partial Observability: Inefficient exploration and lack of recovery from errors in agentic tasks; trial-and-error search rather than strategic execution (Gao et al., 8 Jun 2026, Tang et al., 2024, Zhao et al., 16 Sep 2025).
Combinatorial and Dynamic Complexity Collapse: Near-random performance on multi-view, multi-step, and sim-to-real generalization tasks involving several spatial factors or real-world grounded transformations (Wang et al., 12 Feb 2025, Huang et al., 15 Oct 2025, Wu et al., 25 Oct 2025, Xu et al., 20 Jan 2026).
Hallucination and Logical Inconsistency: In process-evaluation setups, frequent invalid inferences, unjustified leaps, or contradictions in reasoning chains (Wu et al., 25 Oct 2025, Jiang et al., 3 Mar 2026).

Scaling model size, instruction tuning, and chain-of-thought prompting yield only modest improvements for high-complexity tasks; numeric gains are predominantly in perception and simple relational categories (Xu et al., 26 Nov 2025, Spencer et al., 22 Dec 2025, Stogiannidis et al., 25 Mar 2025).

6. Directions for Benchmark and Model Innovation

Identified gaps motivate several directions:

Explicit Geometric and Physics Modules: Integration of 3D pose/orientation estimators, physics engines, and spatial graph architectures (Wang et al., 12 Feb 2025, Wu et al., 25 Oct 2025, Yang et al., 8 Feb 2026).
Programmatic, Multi-modal Testing: Composition of numeric/geometric input (bboxes, polylines) with raster and natural-language contexts for precise, multifaceted evaluation (Xu et al., 17 Feb 2026, Xu et al., 26 Nov 2025).
Process-oriented Metrics: Judgement of reasoning chains, process-tracing, and actionable sequence synthesis rather than only answer accuracy (Wu et al., 25 Oct 2025, Anand et al., 23 Dec 2025).
Interactive, Task-driven Evaluation: Shift from passive VQA paradigms to agentic, partially observable, iterative-action tasks coupled to real-world simulators (Gao et al., 8 Jun 2026, Zhao et al., 16 Sep 2025).
Sim-to-Real and Multi-environment Generalization: Extension of task variety, environment complexity, and observability structure to stress sim-to-real robustness (Gao et al., 8 Jun 2026, Huang et al., 15 Oct 2025).
Cognitive and Processual Grounding: Taxonomies and constructions explicitly grounded in human cognitive science and psychometric frameworks (Xu et al., 26 Nov 2025, Wang et al., 10 Jul 2025, Jiang et al., 3 Mar 2026).

Collectively, these future-oriented approaches aim for models that can “think in space”—not just recognize spatial features, but robustly simulate, plan, and reason about geometric, topological, and physical constraints across modalities, views, and horizons.

7. Representative Benchmarks: Summary Table

Benchmark	Modalities	Categories/Taxonomy	Key Diagnostic Features	arXiv ID
SpatialBench	Multi-modal	5-level cognitive	Unified metric, 15 tasks, L₅ planning	(Xu et al., 26 Nov 2025)
GamiBench	Visual	2D→3D planning	Origami: cross-view, physical feasibility	(Spencer et al., 22 Dec 2025)
GSR-Bench	Visual	Relations, grounding	CircularEval, mask/depth, scaling laws	(Rajabi et al., 2024)
Spatial457	Visual	6D spatial	Level-by-level, unbiased attribute, RPDR	(Wang et al., 12 Feb 2025)
SpinBench	Visual	Perspective/rotation	Egocentric/allocentric, 51 subtypes	(Zhang et al., 29 Sep 2025)
SpatialViz-Bench	Visual	4 visualization skills	12 tasks: rotation, folding, animation	(Wang et al., 10 Jul 2025)
DynaSolidGeo	Multimodal	Solid geometry	Dynamic instance gen, process evaluation	(Wu et al., 25 Oct 2025)
EarthSpatialBench	Geo-visual	Distance, topology	Polygons, polylines, quantitative tasks	(Xu et al., 17 Feb 2026)
CityCube	Visual	Cross-view	Urban, rotation/orbit, 5 cognitive dims	(Xu et al., 20 Jan 2026)
SIRI-Bench	Video	3D math, perception	Video-based, multi-step, automatic gen	(Song et al., 17 Jun 2025)
Spatial-DISE	Visual	DISE quadrants	Multi-view, multi-step reasoning	(Huang et al., 15 Oct 2025)
SpatialWorld	Interactive	POMDP, planning	Text-action, 8 env backends, TSR/effic.	(Gao et al., 8 Jun 2026)
EvoEmpirBench	Agent	Dynamic, experience	Long-horizon, local obs, experience mem	(Zhao et al., 16 Sep 2025)
SpatialText	Text-only	5-level, dual source	Human+synthetic, mental modeling	(Jiang et al., 3 Mar 2026)
SSI-Bench	Visual	Constrained manifold	Real 3D structures, ranking, physics	(Yang et al., 8 Feb 2026)
Spatial4D-Bench	Video+img	6 cognitive domains	~40k QA, spatiotemporal, physical law	(Wang et al., 31 Dec 2025)

These benchmarks collectively constitute the state-of-the-art in evaluating and dissecting spatial reasoning in AI systems and inform the specification of next-generation, spatially-aware models and agentic architectures.