SpatialBench: Evaluating Spatial Cognition

Updated 15 December 2025

SpatialBench is a comprehensive benchmark framework assessing spatial cognition in multimodal models via compositional and hierarchical taxonomies.
It utilizes diverse datasets, multi-view modalities, and rigorous metrics to evaluate tasks like object localization, sequential planning, and causal inference.
Experimental findings reveal significant performance gaps between models and human benchmarks, prompting advances in neuro-symbolic and allocentric representations.

SpatialBench, as a benchmark term and methodology, has emerged as a central paradigm for evaluating the spatial reasoning and spatial cognition abilities of multimodal LLMs (MLLMs) and vision-LLMs (VLMs). Several eponymous and related benchmarks have been published, each targeting distinct facets of spatial intelligence: compositional reasoning in 3D scenes, hierarchical cognition in real-world environments, multi-viewpoint localization, object-centric arrangement, dynamic agent planning, embodied spatial relations, geospatial phenomena, and spatial visualization. This article synthesizes their definitions, taxonomies, methodologies, key experimental findings, and implications for future research.

1. Taxonomic Foundations of SpatialBench Benchmarks

SpatialBench encompasses both compositional and hierarchical taxonomies of spatial reasoning, each tailored to reveal the structure and limits of spatial intelligence in autonomous models:

Compositional Framework (SpaCE-10): The benchmark formalizes spatial intelligence as the integration of ten atomic capabilities: object recognition, spatial localization, spatial relationship judgement, size comparison, counting, function knowledge, multi-view fusion, forward thinking, reverse reasoning, situated observation. Eight compositional tasks are operationalized as explicit compositions of these atomic skills (e.g., entity quantification, object-object relation reasoning, spatial planning), enabling fine-grained diagnosis and cross-model comparison (Gong et al., 9 Jun 2025).
Hierarchical Cognition (SpatialBench/(Xu et al., 26 Nov 2025)): A five-level taxonomy is adopted, progressing from basic observation (object detection, absolute distances), through topological relations (adjacency, containment), symbolic reasoning (multi-hop logic, affordance), causality (spatio-temporal simulation), up to multi-step planning. Fifteen discrete tasks explicitly cover this hierarchy, supported by complexity-weighted metrics to stratify performance and isolate bottlenecks such as symbolic abstraction and causal inference.

Other benchmarks instantiate alternative spatial reasoning architectures: multi-view allocentric/egocentric localization (ViewSpatial-Bench (Li et al., 27 May 2025)), object-centric arrangements and downstream scene retrieval (SpatialBench/object-centric (Mirjalili et al., 26 Sep 2025)), dynamic memory-based planning (EvoEmpirBench (Zhao et al., 16 Sep 2025)), and embodied spatial understanding (EmbSpatial-Bench (Du et al., 9 Jun 2024)).

2. Dataset Composition, Scene Diversity, and Annotation Pipelines

SpatialBench benchmarks emphasize large, controlled datasets that span real and synthetic scenes, diverse modalities, and high annotation quality:

Benchmark	#Scenes/Instances	Modality	Annotation Method	Key Scene Sources
SpaCE-10	811 scenes, ~6k QAs	3D point clouds + 2D	Hierarchical pipeline (LLMs + human verification)	ScanNet, ScanNet++, 3RScan, ARkitScene
SpatialBench (Xu et al., 26 Nov 2025)	50 egocentric videos, 1,347 QAs	2D RGB + 3D LiDAR	Pairwise annotation + model-aided checking	Custom LiDAR + RGB corpus
ViewSpatial-Bench	>1,000 scenes, 5,712 QAs	2D RGB, 3D pose	Automated 3D labeling + manual curation	ScanNet, MS-COCO
Object-centric SpatialBench	11,079 images (synthetic)	Rendered 2D images	Pipeline w/ FLUX-Kontext backgrounds	Custom 3D assets
EmbSpatial-Bench	2,181 images, 3,640 QAs	RGB-D, simulators	Simulator projection, automated, human QA	Matterport3D, AI2-THOR, ScanNet
OBSR	66K-684 hexagons/city, 7 datasets	Multi-modal geospatial vectors	Standardized H3 preprocessing, open-source	Airbnb, crime, taxi, etc.
SpatialViz-Bench	1,180 MC problems	Synthetic visualizations	Python/FreeCAD procedural generation	Benchmark-specific

Annotation pipelines are typically hierarchical, combining automated object selection and scene captioning via LLMs, structured relation parsing, template and model-generated QAs, and multi-stage human vetting (SpaCE-10). For video and sensor modalities, spatial measurements are extracted from LiDAR or depth maps, while 3D viewpoint transformations and camera-pose simulations are supported (ViewSpatial-Bench, EmbSpatial-Bench).

3. Task Categories, Metrics, and Evaluation Protocols

SpatialBench encompasses a spectrum of spatial tasks, each associated with precise quantitative metrics:

Atomic tasks: Object recognition (top-1 acc.), localization (mIoU, AP@t), size comparison, counting (numerical accuracy), function inference.
Compositional tasks: Multi-step planning, reverse reasoning, multi-view fusion (composite accuracy, weighted by task structure).
Hierarchical tasks (Xu et al., 26 Nov 2025):
- Observation: numerical estimation (mean relative acc. $\mathrm{MRA}$ ).
- Relation/topology: MC accuracy.
- Symbolic: logic-chain answer selection (accuracy).
- Causal: event prediction (MC accuracy).
- Planning: sequence instruction execution (accuracy).

Advanced metrics include adaptive capability-oriented overall scores, balancing task complexity and response variance via monotonic weighting and stratified sampling. Instance retrieval tasks use top-k precision and hit rates based on VL-CLIP embeddings. Box and grid localization utilize IoU, macro- $F_1$ , and MCC. Embodied spatial reasoning applies two-stage evaluation (gen/likelihood) and per-relation accuracy.

4. Comparative Experimental Findings and Model Performance Stratification

Benchmarks reveal persistent and substantial gaps between state-of-the-art models and human-level spatial cognition:

Benchmark/Task	Top Machine (%)	Human Baseline (%)	Human–Model Gap
SpaCE-10 Overall Accuracy	46.7 (LLaVA-OneVision)	72.0	−25.3 pp
SpaCE-10 Counting (C₅)	38.8	72.0	−33.2 pp
SpatialBench Overall (Gemini-2.5-pro)	75.79	96.40	−20.6 pp
Hierarchical Levels L₃–L₅ (Gemini)	~74–85	~100	−15–26 pp
ViewSpatial-Bench Zero-Shot Overall	40.3 (best)	—	—
ViewSpatial-Bench MVSM Fine-Tuned	82.09	—	—
Object-centric Localization (mIoU)	0.77 (detector)	—	—
Embodied Bench Zero-Shot (best open)	49.1–78.1 (tuned)	90.3	−12–56 pp
SpatialViz-Bench Visualization	44.7 (Gemini)	>80	−35+ pp

Key findings:

Counting and multi-object composition remain bottleneck skills across both SpaCE-10 and hierarchical SpatialBench.
High-level reasoning tasks (multi-hop logic, causal event prediction, planning) display the sharpest performance cliffs, with models often falling below 40% except for the largest proprietary MLLMs.
3D-centric models trail behind 2D counterparts on real scene QA, suggesting unresolved 3D-to-language alignment difficulties.
In multi-view spatial localization (ViewSpatial-Bench), egocentric tasks are solved reasonably, but allocentric/person-perspective tasks expose marked deficits, mitigated only by large-scale fine-tuning (+46.2% accuracy improvement).
Numeric depth estimation (when trained with paired RGB-D) approaches perfect accuracy (>99%), but comparative reasoning (size, reach) lags (<60%).

5. Methodological Innovations and Architectural Implications

SpatialBench benchmarks have catalyzed methodological advances:

Hierarchical annotation pipelines integrating LLM-based captioning, CLIP embedding selection, automated QA generation, human Gradio-based curation, and cross-capability integration (Gong et al., 9 Jun 2025).
Complexity-aware adaptive scoring systems linking per-task statistics to overall capability metrics (Xu et al., 26 Nov 2025).
Automated 3D annotation frameworks for multi-view and viewpoint-dependent spatial labeling, leveraging pose estimation and geometric computation (Li et al., 27 May 2025).
Dynamic agent-centric memory via subjective experience abstraction and truth induction (Agent-ExpVer), supporting continual policy refinement and episodic strategy consolidation (Zhao et al., 16 Sep 2025).

A plausible implication is that sustainable progress in model spatial intelligence demands explicit modularization of perceptual, relational, and symbolic reasoning; improved 3D/allocentric representation architectures; and experience-driven memory mechanisms.

6. Comparison to Prior and Contemporary Benchmarks

SpatialBench taxonomies and datasets subsume and advance upon earlier benchmarks:

Scene and QA volume consistently exceed predecessors, e.g., SpaCE-10’s 811 scenes and ~6k QAs vs. previous <400 scenes and ≤10 atomic QA types.
Modalities incorporate both 2D and 3D sensors, depth maps, RGB-D frames, and egocentric video sequences, supporting robustness across embodied and exocentric settings.
Task diversity now spans from basic localization and containment to multi-hop symbolic logic, causal simulation, dynamic planning, geospatial regression, and pure spatial visualization.
Metric rigor is enforced via strict accuracy, mIoU, macro- $F_1$ , capability-oriented standardized scores, and multi-choice robustness checks.

These benchmarks collectively position SpatialBench as the leading reference architecture for spatial reasoning assessment, facilitating apples-to-apples model comparison and systematic diagnosis of spatial cognition gaps.

7. Ongoing Challenges and Future Research Directions

SpatialBench research points to unresolved challenges and promising research vectors:

Integrating neuro-symbolic reasoning modules for multi-hop spatial logic and causal inference.
Designing physically grounded simulation models to address dynamic environment and planning tasks.
Improving model robustness on multi-choice, object composition, and size/extents estimation tasks.
Enhancing allocentric and multi-view spatial representations to overcome egocentric bias.
Scaling continual learning and memory via subjective experience replay and cross-task “truth” induction.

Future benchmarks are likely to incorporate more granular spatial relation taxonomies ("between," "diagonal," "near"), richer multi-agent coordination, explicit temporal grounding, and broader domain coverage (e.g., environmental sensors, accessibility metrics).

References:

"SpaCE-10: A Comprehensive Benchmark for Multimodal LLMs in Compositional Spatial Intelligence" (Gong et al., 9 Jun 2025)
"SpatialBench: Benchmarking Multimodal LLMs for Spatial Cognition" (Xu et al., 26 Nov 2025)
"ViewSpatial-Bench: Evaluating Multi-perspective Spatial Localization in Vision-LLMs" (Li et al., 27 May 2025)
"Spatial Reasoning in Foundation Models: Benchmarking Object-Centric Spatial Understanding" (Mirjalili et al., 26 Sep 2025)
"EvoEmpirBench: Dynamic Spatial Reasoning with Agent-ExpVer" (Zhao et al., 16 Sep 2025)
"EmbSpatial-Bench: Benchmarking Spatial Understanding for Embodied Tasks with Large Vision-LLMs" (Du et al., 9 Jun 2024)
"GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs" (Rajabi et al., 19 Jun 2024)
"OBSR: Open Benchmark for Spatial Representations" (Moska et al., 7 Oct 2025)
"SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs" (Wang et al., 10 Jul 2025)