Spatial Supersensing Benchmarks Overview
- Spatial supersensing benchmarks are systematic evaluations of models’ ability to perceive and reason about spatial structure, geometry, and semantics across diverse domains.
- They bridge atomic tasks like object localization with complex compositional challenges such as multi-view fusion and long-horizon memory, essential for robust real-world applications.
- Detailed measurement protocols and annotation pipelines enable precise error attribution, inspiring advances in multimodal fusion and efficient, real-time spatial processing.
Spatial supersensing benchmarks provide systematic, fine-grained, and multi-domain evaluations of models’ ability to perceive and reason about spatial structure, geometry, and spatial semantics across 2D, 3D, multimodal, and sequential domains. These benchmarks probe not only basic spatial primitives—such as object localization and pairwise spatial relations—but also compositional, memory-intensive, and world-modeling capabilities required for robust performance in complex or long-horizon tasks. This article organizes and analyzes the leading spatial supersensing benchmarks, their task definitions, construction pipelines, evaluation protocols, and salient empirical findings, covering both static (single or few-frame), dynamic (long-horizon video), and geospatial (remote sensing) settings.
1. Taxonomy of Spatial Supersensing Benchmarks
Spatial supersensing benchmarks fall along several axes:
- Atomic vs. Compositional Reasoning: Some benchmarks (e.g., SpaCE-10) measure “atomic” capabilities such as object recognition (C₁), localization (C₂), or spatial relationships (C₃), while explicitly evaluating the fusion of these primitives into higher-order, compositional tasks (e.g., functional reasoning).
- 2D/3D Grounding: Datasets test object-centric or scene-centric spatial understanding, with modalities spanning RGB, metric depth, point clouds, or DEM (digital elevation maps).
- Memory and Long-horizon Cognition: Streaming benchmarks (e.g., VSI-SUPER) test continual recall, counting, and memory organization over arbitrarily long videos under domain shift and context-limited regimes.
- Functional Application Domains: Task domains include indoor robotics and embodied AI, remote sensing and geospatial intelligence, cognitive psychology, and autonomous navigation.
A non-exhaustive mapping is provided below:
| Benchmark | Domain | Modalities | Core Tasks |
|---|---|---|---|
| SpaCE-10 | 3D indoor | RGB, point cloud | Atomic & compositional spatial QA |
| RS3DBench | Remote sens | RGB, DEM, text | 3D depth estimation, multi-modal fusion |
| MM-Spatial | 3D indoor | RGB, depth, cloud | Size, distance, relation, 3D grounding |
| E3D-Bench | 3D general | RGB, multi-view | Depth, 3D recon, pose, view synthesis |
| VSI-SUPER | Video | Long-form video | Streaming recall/count, predictive world |
| SPACE | Cognitive | Text, images | Mapping, layout, memory, planning |
| What’sUp | 2D images | Photos | Fine-grained spatial preposition detection |
2. Task Definitions and Structural Design
SpaCE-10 (Gong et al., 9 Jun 2025) formalizes 10 atomic spatial capabilities, including object recognition, spatial localization, qualitative and topological relationships, size comparison, counting, function knowledge, multi-view fusion, forward/reverse reasoning, and situated observation. These are combined into 8 compositional QA types (e.g., Entity Quantification, Functional Reasoning), with each type invoking several atomic capabilities.
CA-VQA from MM-Spatial (Daxberger et al., 17 Mar 2025) covers:
- Binary and multi-choice spatial relationship prediction (e.g., left-of, in-front-of).
- Metric regression for size and distance.
- 3D amodal grounding (predicting 8-corner bounding boxes).
E3D-Bench (Cong et al., 2 Jun 2025) generalizes spatial supersensing for 3D foundation models by evaluating:
- Sparse-view and video depth estimation,
- 3D reconstruction (point cloud, mesh),
- Multi-view pose estimation,
- Novel view synthesis.
RS3DBench (Wang et al., 23 Sep 2025) focuses on precision-aligned depth estimation from RGB + DEM + text globally, in supervised and zero-shot/transfer regimes, and introduces multimodal fusion tasks.
VSI-SUPER (Yang et al., 6 Nov 2025) in Cambrian-S distinguishes:
- Long-horizon spatial recall (VSR)—identifying the temporal order of rare spatially-specific events.
- Visual spatial counting (VSC)—tracking the unique count of object categories across discontinuous, streaming video.
SPACE (Ramakrishnan et al., 9 Oct 2024) and What’sUp (Kamath et al., 2023) instead probe cognitive capabilities using classic spatial cognition tasks (e.g., mental rotation, mapping, route retracing, relational prepositions) in controlled settings.
3. Data Generation, Annotation Pipelines, and Modalities
Construction pipelines are typically hierarchical and semi-automatic:
- SpaCE-10: Five-stage annotation (snapshots, CAPTION→inspection, QA generation, verification, cross-capability integration) yields 6,000+ QA pairs for 811 real 3D indoor scenes (ScanNet, ScanNet++, 3RScan, ARKitScenes). Each QA sample averages 3.2 atomic capabilities.
- CA-VQA (MM-Spatial): Assembled from ∼2,000 ARKitScenes videos, yielding 62K QA pairs across spatial relationship, size, distance, and 3D grounding, filtered to ensure vision dependence.
- RS3DBench: 54,951 RGB–DEM–text triplets, globally distributed and multi-resolution (0.5–30 m), with pixel-level alignment and text grounding by LLM-guided, expert-reviewed annotation.
- E3D-Bench: Aggregates standard 3D datasets across scales (indoor, outdoor, aerial, synthetic), supporting pipeline-standardized evaluation for over 16 state-of-the-art models.
- VSI-SUPER: Generates streaming QA for both recall and counting, leveraging curated real and simulated data (VSI-590K SFT) and adversarial insertion (objects, room transitions).
Modalities range from dense metric depth, point clouds, and 3D scene graphs to egocentric/allocentric imagery and tokenized textual grids.
4. Metrics, Protocols, and Model Comparisons
Benchmarks employ task-appropriate metrics:
- Classification/QA: (used in SpaCE-10, CA-VQA, SPACE, What’sUp).
- Regression: Mean absolute error (MAE), absolute relative error (AbsRel), and accuracy-at-relative-threshold in distance/size (CA-VQA, RS3DBench, E3D-Bench).
- 3D Grounding: Intersection over Union (IoU), AP@threshold, , average precision.
- Streaming/Memory Tasks: Mean relative accuracy, temporal recall, success-weighted path length (SPL).
Protocols standardize prompt design, enforce vision-only reliance by filtering language-bias-prone QA, and use GPT-4o or other LLMs as external judges in some settings (e.g., SpaCE-10).
Representative model results on SpaCE-10 (Gong et al., 9 Jun 2025):
| Model | Overall Accuracy |
|---|---|
| Human | 72.0% |
| GPT-4o (closed) | 37.4% |
| LLaVA-OneVision-72B | 46.7% |
| InternVL2.5-78B (2D) | 44.1% |
| LEO-7B (specialized 3D) | 12.3% |
| GPT4Scene | 28.0% |
On CA-VQA (Daxberger et al., 17 Mar 2025), MM-Spatial-3B achieves 47–49% (overall), with chaining, multi-view, and depth tool-use yielding incremental boosts.
RS3DBench (Wang et al., 23 Sep 2025) empirical results indicate that text-conditioned stable-diffusion depth estimation outperforms pix2pix GANs (MAE 23.4 vs. 34.8), with additional cross-terrain ablations confirming the value of multimodal (text+image) fusion.
5. Limiting Factors and Failure Modes
Current spatial supersensing benchmarks expose critical model limitations:
- Compositionality: Integrating multi-view fusion, reverse reasoning, and situated observation leads to order-of-magnitude performance drops—e.g., FR in SpaCE-10 decreases from 83.0% to 28.2% when C₇ (multi-view), C₉ (reverse), C₁₀ (situated) are present.
- Counting: Accurate enumeration (C₅) remains the dominant bottleneck, with MLLMs achieving 20–30% versus human 69% on counting tasks (Gong et al., 9 Jun 2025).
- Long-horizon Memory: VSI-SUPER exposes that brute-force token expansion is insufficient for unbounded streams—models with surprise-driven memory modules, rather than simple context accumulation, can maintain recall/count performance (Yang et al., 6 Nov 2025).
- Shortcut Exploitation: Analysis in (Udandarao et al., 20 Nov 2025) demonstrates that current streaming benchmarks (e.g., VSR/VSC) may be solved by retrieval-style pipelines or segment-based heuristics, failing genuine world-modeling: the NoSense baseline solves VSR near-perfectly by semantic matching, ignoring temporal or spatial structure. VSC collapses under segment repeats due to the lack of object identity persistence.
- Spatial Prepositions: Datasets such as What’sUp (Kamath et al., 2023) show that even large-scale VL models struggle with controlled spatial relation discrimination, often performing at or near chance, unaffected by diverse modeling interventions.
- Scaling Law Gaps: Augmenting data or parameters (e.g. S-7B, VSI-590K) yields improvement on short-form, pre-segmented spatial QA but does not bridge the compositional or memory gap on streaming or compositional reasoning (Yang et al., 6 Nov 2025, Ramakrishnan et al., 9 Oct 2024).
6. Implications for Research and Application Domains
Spatial supersensing benchmarks drive progress in several directions:
- Diagnosis and Fine-grained Error Attribution: Hierarchical capability breakdowns (as in SpaCE-10) provide concrete axes along which to interpret failures and prioritize research (e.g., counting, multi-view fusion).
- Benchmark-driven Model Design: Tool-use strategies (depth estimation, CoT), multi-view aggregation, and surprise-driven memory/segmentation modules have all been directly motivated by benchmark findings (Daxberger et al., 17 Mar 2025, Yang et al., 6 Nov 2025).
- Robust Geospatial AI: RS3DBench enables the transfer of RGB-D research from close-range to multi-resolution remote sensing, supporting applications in environmental change detection, planning, and disaster response (Wang et al., 23 Sep 2025).
- Generalization and Transfer: E3D-Bench and CA-VQA include OOD splits (e.g., air-ground, synthetic/faked 3D, large-scale metric ranges) to test the reliability of supersensing under significant data or domain distribution shifts (Cong et al., 2 Jun 2025, Daxberger et al., 17 Mar 2025).
- Design Guidelines for Future Benchmarks: Adversarial meta-evaluation, structural invariance checks (repeat, shuffle, revisit), continuous video, and open-ended (non-multichoice) targets are recommended to close shortcut solutions and truly test persistent world-modeling (Udandarao et al., 20 Nov 2025).
7. Open Problems and Future Directions
Key directions explicitly emerging from these benchmarks:
- True Spatial Memory: Developing models that can maintain object identity and world maps across occlusion, revisit, frame permutation, and long time horizons (Yang et al., 6 Nov 2025, Udandarao et al., 20 Nov 2025).
- Hybrid and Multimodal Fusion: Rigorously integrating depth, text, and novel sensing modalities (SAR, multispectral, tactile) is highlighted as a necessity for robust supersensing, especially under adverse or sparse data (Wang et al., 23 Sep 2025).
- Cognitive Alignment: Benchmarks inspired by animal and human spatial cognition (e.g., SPACE) demonstrate that large models remain well below embodied levels of spatial representation—fusing perception, mapping, and planning (Ramakrishnan et al., 9 Oct 2024).
- Efficient Real-time Inference: For robotics and autonomous systems, benchmarks call for investment in efficient, online, memory-limited models capable of sub-second supersensing under real-world compute constraints (Cong et al., 2 Jun 2025).
- Universal Spatial Foundation Models: Recommendations include expanding to spatiotemporal supersensing, standardizing open challenges for complex, dynamic environments, and integrating meta-evaluation protocols to avoid trivial or shortcut solutions (Wang et al., 23 Sep 2025, Udandarao et al., 20 Nov 2025).
Spatial supersensing benchmarks are thus essential diagnostic tools, revealing the persistent limitations and requisite advances for robust spatial cognition in AI systems across vision, language, 3D geospatial analysis, and embodied intelligence.