SpatialBench: 3D Spatial Reasoning Benchmark
- SpatialBench is a specialized benchmark for assessing 3D spatial reasoning by overcoming the limitations of traditional 2D-based vision-language evaluations.
- It incorporates six diverse spatial tasks, including metric depth estimation, proximity, contact queries, counting, and size comparisons, to ensure comprehensive spatial understanding.
- The benchmark offers a standardized, drop-in evaluation suite that facilitates systematic ablations and improvements in embodied AI and spatially-aware vision-language models.
SpatialBench is a fixed, test-only benchmark specifically designed for the comprehensive evaluation of spatial understanding in vision–LLMs (VLMs), with a focus on embodied AI and fine-grained 3D spatial reasoning. It addresses the limitation of previous benchmarks that concentrate on 2D recognition or logical attribute combinations, and lack metric depth queries or high-level physical interaction assessments. SpatialBench unifies metric depth estimation, proximity comparisons, contact/reaching questions, counting, and size reasoning in a multi-category suite, establishing a new diagnostic standard for spatial and embodied reasoning capabilities in multimodal systems (Cai et al., 19 Jun 2024).
1. Benchmark Scope and Motivation
SpatialBench was constructed to address the lack of diagnostic tools for a VLM’s ability to perform genuine 3D spatial inference. Existing VQA and scene-understanding benchmarks (e.g., VQA-v2, GQA, MMBench, SceneGraph-QA) remain limited to 2D annotations or lack depth measurement ground truth, enabling models to exploit 2D spatial cues without engaging 3D reasoning. SpatialBench is the first evaluation set to require and quantify spatial skills across a spectrum from raw metric depth up to high-level contact (“has-touched?”), size superlatives, and multi-object positional queries.
The primary goals are:
- To isolate the effect of raw depth maps by comparing RGB-D with RGB-only VLM capabilities;
- To enforce diverse spatial tasks preventing overfitting to narrow subtasks;
- To provide a standard drop-in evaluation set with fine-grained, human- and LLM-verified queries (Cai et al., 19 Jun 2024).
2. Dataset Construction and Task Coverage
SpatialBench comprises 120 distinct RGB-D images, derived partly from MME’s public test suite and enhanced with 80 hand-annotated scenes. There are no train or validation splits—SpatialBench is strictly for test-only evaluation to avoid data contamination and overfitting (Cai et al., 19 Jun 2024).
Each image is rescaled to 384×384 pixels. Six spatial task categories are represented, each with 20 images, ensuring coverage of low-, mid-, and high-level spatial reasoning:
| Task Category | Query Type | Examples / Ground Truth |
|---|---|---|
| Depth Estimation | Metric prediction at annotated object | 60 depth queries, uint24 mm |
| Proximity/Position | Pairwise “Which is closer?” | 60 pairwise queries |
| Existence | Binary presence | 20 Yes/No queries |
| Counting | Integer class count | 20 queries |
| Reaching (Contact) | Has A touched B? Yes/No | 20 paired (pos/neg) queries |
| Size Comparison | Which of [A,B,...] is biggest? | 20 queries |
Depth ground truth is measured directly by sensor or via ZoeDepth MDE, stored in precise metric form. Proximity and size queries use object-level bounding boxes and depth. All data are human- and GPT-4o-verified for correctness and ambiguity reduction (Cai et al., 19 Jun 2024).
3. Task Definitions and Protocols
SpatialBench formalizes six categories of spatial queries. Representative definitions and evaluation details are:
- Depth Estimation: Given point in image, model predicts . Prediction is considered correct if , i.e., within 10% of ground truth.
- Proximity/Position: Given , select which object has smaller center/min-depth in image space.
- Existence: Returns Yes/No on the presence of a named object.
- Counting: Predicts count of a designated category.
- Reaching (Contact): Yes/No label for whether A physically contacts B (must handle both positive and negative forms).
- Size Comparison: Multi-choice for largest (or smallest) among specified objects.
Paired questions (e.g., both “Has A touched B?” and “Has B touched A?”) must be answered correctly for credit (“bonus” scoring). All tasks are scored by exact match, except depth estimation (10% tolerance) (Cai et al., 19 Jun 2024).
4. Evaluation Metrics and Baseline Results
Evaluation proceeds via per-task simple accuracy:
For depth:
There is no use of F1, precision, or recall as all problems reduce to exact-match or thresholded metric accuracy (Cai et al., 19 Jun 2024).
Key baseline metrics (percent accuracy) are summarized below (excerpted from Table 1):
| Model | Depth | Position | Exist | Count | Reaching | Size |
|---|---|---|---|---|---|---|
| GPT-4o (RGB) | – | 70.6 | 85.0 | 84.5 | 51.7 | 43.3 |
| GPT-4o (RGB-D) | – | 61.8 | 90.0 | 85.2 | 51.7 | 40.0 |
| Bunny-Phi2-3B (RGB) | 70.6 | 50.0 | 75.0 | 89.4 | 51.7 | 26.7 |
| SpatialBot-Phi2-3B (RGB) | 84.1 | 64.7 | 80.0 | 88.0 | 61.7 | 28.3 |
| Bunny-Phi2-3B (RGB-D) | 85.8 | 50.0 | 75.0 | 90.4 | 43.3 | 28.3 |
| SpatialBot-Phi2-3B (RGB-D) | >99 | 61.8 | 80.0 | 91.7 | 55.0 | 26.7 |
Salient findings:
- Depth (RGB-D) signals are essential for high-precision depth prediction; fine-tuning is required to exploit them fully (SpatialBot-Phi2-3B (RGB-D): >99%).
- Generic VLMs and popular architectures (e.g., GPT-4o) show limited transfer to spatial tasks, especially for Position and Size.
- Counting is less sensitive to depth or fine-tuning, maintaining >88% accuracy across models.
- Reaching/contact and Size remain hard even with RGB-D inputs—suggesting these rely on high-level geometric or physical reasoning not captured by current LLM or VLM encoders (Cai et al., 19 Jun 2024).
5. Distinctiveness Versus Prior Benchmarks
Unlike VQA-v2, GQA, MMBench, or SceneGraph-QA, which are 2D-centric, SpatialBench leverages raw metric depth, physical-contact, and superlative object comparisons. In existing benchmarks, object proximity or “left/right” can be deduced from projected bounding boxes without 3D calibration; SpatialBench’s formulation explicitly requires metric depth for correctness (Cai et al., 19 Jun 2024).
SpatialBench also differs by issuing multiple diverse task types per image and using paired accuracy constraints to prevent superficial multi-task overfitting.
6. Implications and Utility
SpatialBench enables systematic ablations—comparing any new VLM variant (RGB-only, RGB-D, fine-tuned, instruction-tuned) for 3D reasoning capabilities. The suite is directly applicable as a diagnostic for
- Embodied-AI agents with manipulation, navigation, or placement policies
- Models employing point-cloud or depth-sensor extensions
- Future work on spatial instruction-following, physical simulation, or comprehensive 3D visual understanding.
Notably, SpatialBench has already been used to validate that depth-API augmentation and progressive tuning (SpatialBot) increase raw depth estimation from ~70% to >99% and yield marked increases on proximity/contact queries (Cai et al., 19 Jun 2024). This suggests the benchmark is sensitive enough to guide model improvement and the design of spatially-aware VLMs.
7. Future Directions
SpatialBench points toward several promising avenues:
- Developing VLMs that explicitly integrate scene geometry or 3D symbolic reasoning modules;
- Extending to larger benchmarks and domain-specific splits (e.g., robotics, outdoor navigation);
- Incorporating temporal sequences for tracking, sequential manipulation, or physically-plausible reasoning.
A plausible implication is that SpatialBench's design, with its multi-task span and metric depth requirements, forms a rigorous foundation for the next generation of embodied AI evaluations, complementing emerging 3D and spatially-focused datasets and protocols (Cai et al., 19 Jun 2024).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free