Papers
Topics
Authors
Recent
Search
2000 character limit reached

Spatial Reasoning Benchmarks

Updated 26 June 2026
  • Spatial reasoning benchmarks are systematic evaluations that assess computational agents’ abilities to perceive, represent, and manipulate spatial structures and transformations in multi-dimensional settings.
  • They employ rigorous methodologies like procedural generation, expert annotation, and simulation to classify tasks across perceptual, geometric, and planning dimensions.
  • Insights from these benchmarks reveal key model limitations, guiding advances in embodied AI, robotics, and scene understanding.

Spatial Reasoning Benchmarks

Spatial reasoning benchmarks provide systematic, multi-task evaluations of computational agents’ abilities to perceive, represent, and manipulate spatial relations, structure, transformations, and trajectories in 2D, 3D, and 4D settings. These benchmarks are designed to probe distinct facets of spatial intelligence—ranging from low-level perceptual grounding to high-level causal inference and planning—using controlled datasets, diverse task taxonomies, and rigorously defined performance metrics. They are vital for revealing persistent limitations in contemporary LLMs, vision-LLMs (VLMs), and multimodal LLMs (MLLMs), and serve as the basis for progress in embodied AI, robotics, scene understanding, and agentic systems.

1. Taxonomies of Spatial Reasoning Abilities

Benchmarks have formalized a range of taxonomic frameworks to partition spatial cognition hierarchically or quadrantically.

Benchmarks such as SpatialBench (Xu et al., 26 Nov 2025), Spatial-DISE (Huang et al., 15 Oct 2025), and GamiBench (Spencer et al., 22 Dec 2025) explicitly operationalize these taxonomies in their dataset construction and task design, facilitating both fine-grained skill attribution and unified capability metrics.

2. Benchmark Construction Methodologies

Spatial reasoning benchmarks employ a range of rigorous methodologies optimized for reproducibility, coverage, and diagnostic power.

These methods enable large-scale, balanced, and robust datasets supporting both passive and interactive task formats. Controlled distractor generation and multi-view or multi-step setups are standard to preclude superficial pattern matching.

3. Task Typologies and Diagnostic Sub-abilities

Benchmarks comprehensively span sub-skills to dissect spatial competence:

Benchmarks have instantiated specialized metrics beyond raw accuracy, e.g., Viewpoint Consistency (VC) and Impossible Fold Selection Rate (IFSR) in GamiBench (Spencer et al., 22 Dec 2025), Relative Performance Dropping Rate (RPDR) in Spatial457 (Wang et al., 12 Feb 2025), and process-qualified accuracy in DynaSolidGeo (Wu et al., 25 Oct 2025).

4. Empirical Findings and Performance Stratification

Systematic evaluation across dozens of state-of-the-art open-source and proprietary models reveals consistent trends and bottlenecks:

Benchmark Human Best Top Model Open Model Random Notable Failure Modes
SpatialBench 96.4% overall Gemini-2.5-p Qwen3-VL-235B n/a Symbolic (L3), Causal (L4), Planning (L5)
GSR-Bench >90% Subset A LLaVA-NeXT Qwen1.5-110B ~25% Behind/in-front relation, small objects
CityCube 88.3% Doubao-1.6 GLM-4.1V-9B 22.8% Cross-view, scale, egocentric rotation
SpinBench 91.2% InternVL3-38 InternVL-14B n/a Mental/persp. rotation, viewpoint change
SpatialViz-Bench ~95% Gemini-2.5-p LLama-4-Scout 25–27% 3D folding, animation, formulaic bias
DynaSolidGeo n/a GPT-5 Qwen3-VL-30B n/a Visual perception, logic, hallucination
Spatial457 n/a GPT-4o InternVL2 8B n/a 3D pose, depth, collision (6D)
Spatial-DISE 76.8% Doubao1.5VL InternVL-3 25% Multi-step, multi-view reasoning
EarthSpatialBench F1≈ 0.91 (Within) Gemini-2.5-p Qwen3-VL-T-30 n/a Visual grounding, composite geometry
SIRI-Bench ~70% (<60% error) Doubao-1.5-p Qwen2.5-VL-72 n/a Parameter extraction from video
SpatialWorld n/a GPT-5 (17.4%) Qwen-3.5 (14.1%) n/a Partial observability, long-horizon plan
EvoEmpirBench n/a - - n/a Local memory, dynamic state update

Across all benchmarks, performance decays markedly as tasks move from static, single-image perception (object detection, simple relations) to high-dimensional, dynamic, multi-step, and cross-perspective reasoning (mental rotation, planning, causal inference, spatiotemporal prediction). Even top proprietary models routinely trail human accuracy by 20–50 percentage points on composite tasks (Xu et al., 26 Nov 2025, Spencer et al., 22 Dec 2025, Xu et al., 20 Jan 2026, Zhao et al., 16 Sep 2025).

5. Failure Modes and Diagnostic Insights

Benchmarks systematically expose persistent model deficiencies, which cluster into:

Scaling model size, instruction tuning, and chain-of-thought prompting yield only modest improvements for high-complexity tasks; numeric gains are predominantly in perception and simple relational categories (Xu et al., 26 Nov 2025, Spencer et al., 22 Dec 2025, Stogiannidis et al., 25 Mar 2025).

6. Directions for Benchmark and Model Innovation

Identified gaps motivate several directions:

Collectively, these future-oriented approaches aim for models that can “think in space”—not just recognize spatial features, but robustly simulate, plan, and reason about geometric, topological, and physical constraints across modalities, views, and horizons.

7. Representative Benchmarks: Summary Table

Benchmark Modalities Categories/Taxonomy Key Diagnostic Features arXiv ID
SpatialBench Multi-modal 5-level cognitive Unified metric, 15 tasks, L₅ planning (Xu et al., 26 Nov 2025)
GamiBench Visual 2D→3D planning Origami: cross-view, physical feasibility (Spencer et al., 22 Dec 2025)
GSR-Bench Visual Relations, grounding CircularEval, mask/depth, scaling laws (Rajabi et al., 2024)
Spatial457 Visual 6D spatial Level-by-level, unbiased attribute, RPDR (Wang et al., 12 Feb 2025)
SpinBench Visual Perspective/rotation Egocentric/allocentric, 51 subtypes (Zhang et al., 29 Sep 2025)
SpatialViz-Bench Visual 4 visualization skills 12 tasks: rotation, folding, animation (Wang et al., 10 Jul 2025)
DynaSolidGeo Multimodal Solid geometry Dynamic instance gen, process evaluation (Wu et al., 25 Oct 2025)
EarthSpatialBench Geo-visual Distance, topology Polygons, polylines, quantitative tasks (Xu et al., 17 Feb 2026)
CityCube Visual Cross-view Urban, rotation/orbit, 5 cognitive dims (Xu et al., 20 Jan 2026)
SIRI-Bench Video 3D math, perception Video-based, multi-step, automatic gen (Song et al., 17 Jun 2025)
Spatial-DISE Visual DISE quadrants Multi-view, multi-step reasoning (Huang et al., 15 Oct 2025)
SpatialWorld Interactive POMDP, planning Text-action, 8 env backends, TSR/effic. (Gao et al., 8 Jun 2026)
EvoEmpirBench Agent Dynamic, experience Long-horizon, local obs, experience mem (Zhao et al., 16 Sep 2025)
SpatialText Text-only 5-level, dual source Human+synthetic, mental modeling (Jiang et al., 3 Mar 2026)
SSI-Bench Visual Constrained manifold Real 3D structures, ranking, physics (Yang et al., 8 Feb 2026)
Spatial4D-Bench Video+img 6 cognitive domains ~40k QA, spatiotemporal, physical law (Wang et al., 31 Dec 2025)

These benchmarks collectively constitute the state-of-the-art in evaluating and dissecting spatial reasoning in AI systems and inform the specification of next-generation, spatially-aware models and agentic architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spatial Reasoning Benchmarks.