SpatialBench: 3D Spatial Reasoning Benchmark

Updated 19 November 2025

SpatialBench is a specialized benchmark for assessing 3D spatial reasoning by overcoming the limitations of traditional 2D-based vision-language evaluations.
It incorporates six diverse spatial tasks, including metric depth estimation, proximity, contact queries, counting, and size comparisons, to ensure comprehensive spatial understanding.
The benchmark offers a standardized, drop-in evaluation suite that facilitates systematic ablations and improvements in embodied AI and spatially-aware vision-language models.

SpatialBench is a fixed, test-only benchmark specifically designed for the comprehensive evaluation of spatial understanding in vision–LLMs (VLMs), with a focus on embodied AI and fine-grained 3D spatial reasoning. It addresses the limitation of previous benchmarks that concentrate on 2D recognition or logical attribute combinations, and lack metric depth queries or high-level physical interaction assessments. SpatialBench unifies metric depth estimation, proximity comparisons, contact/reaching questions, counting, and size reasoning in a multi-category suite, establishing a new diagnostic standard for spatial and embodied reasoning capabilities in multimodal systems (Cai et al., 2024).

1. Benchmark Scope and Motivation

SpatialBench was constructed to address the lack of diagnostic tools for a VLM’s ability to perform genuine 3D spatial inference. Existing VQA and scene-understanding benchmarks (e.g., VQA-v2, GQA, MMBench, SceneGraph-QA) remain limited to 2D annotations or lack depth measurement ground truth, enabling models to exploit 2D spatial cues without engaging 3D reasoning. SpatialBench is the first evaluation set to require and quantify spatial skills across a spectrum from raw metric depth up to high-level contact (“has-touched?”), size superlatives, and multi-object positional queries.

The primary goals are:

To isolate the effect of raw depth maps by comparing RGB-D with RGB-only VLM capabilities;
To enforce diverse spatial tasks preventing overfitting to narrow subtasks;
To provide a standard drop-in evaluation set with fine-grained, human- and LLM-verified queries (Cai et al., 2024).

2. Dataset Construction and Task Coverage

SpatialBench comprises 120 distinct RGB-D images, derived partly from MME’s public test suite and enhanced with 80 hand-annotated scenes. There are no train or validation splits—SpatialBench is strictly for test-only evaluation to avoid data contamination and overfitting (Cai et al., 2024).

Each image is rescaled to 384×384 pixels. Six spatial task categories are represented, each with 20 images, ensuring coverage of low-, mid-, and high-level spatial reasoning:

Task Category	Query Type	Examples / Ground Truth
Depth Estimation	Metric prediction at annotated object	60 depth queries, uint24 mm
Proximity/Position	Pairwise “Which is closer?”	60 pairwise queries
Existence	Binary presence	20 Yes/No queries
Counting	Integer class count	20 queries
Reaching (Contact)	Has A touched B? Yes/No	20 paired (pos/neg) queries
Size Comparison	Which of [A,B,...] is biggest?	20 queries

Depth ground truth is measured directly by sensor or via ZoeDepth MDE, stored in precise metric form. Proximity and size queries use object-level bounding boxes and depth. All data are human- and GPT-4o-verified for correctness and ambiguity reduction (Cai et al., 2024).

3. Task Definitions and Protocols

SpatialBench formalizes six categories of spatial queries. Representative definitions and evaluation details are:

Depth Estimation: Given point $p = (x,y)$ in image, model predicts $\hat d(p)$ . Prediction is considered correct if $|\hat d_i - d_i| \le 0.1d_i$ , i.e., within 10% of ground truth.
Proximity/Position: Given $A,B$ , select which object has smaller center/min-depth in image space.
Existence: Returns Yes/No on the presence of a named object.
Counting: Predicts count of a designated category.
Reaching (Contact): Yes/No label for whether A physically contacts B (must handle both positive and negative forms).
Size Comparison: Multi-choice for largest (or smallest) among specified objects.

Paired questions (e.g., both “Has A touched B?” and “Has B touched A?”) must be answered correctly for credit (“bonus” scoring). All tasks are scored by exact match, except depth estimation (10% tolerance) (Cai et al., 2024).

4. Evaluation Metrics and Baseline Results

Evaluation proceeds via per-task simple accuracy:

$\text{Accuracy} = \frac{1}{N}\sum_{i=1}^N \mathbb{1}(\hat y_i = y_i)$

For depth:

$\left|\hat d_i - d_i\right| \leq 0.1\,d_i$

There is no use of F1, precision, or recall as all problems reduce to exact-match or thresholded metric accuracy (Cai et al., 2024).

Key baseline metrics (percent accuracy) are summarized below (excerpted from Table 1):

Model	Depth	Position	Exist	Count	Reaching	Size
GPT-4o (RGB)	–	70.6	85.0	84.5	51.7	43.3
GPT-4o (RGB-D)	–	61.8	90.0	85.2	51.7	40.0
Bunny-Phi2-3B (RGB)	70.6	50.0	75.0	89.4	51.7	26.7
SpatialBot-Phi2-3B (RGB)	84.1	64.7	80.0	88.0	61.7	28.3
Bunny-Phi2-3B (RGB-D)	85.8	50.0	75.0	90.4	43.3	28.3
SpatialBot-Phi2-3B (RGB-D)	>99	61.8	80.0	91.7	55.0	26.7

Salient findings:

Depth (RGB-D) signals are essential for high-precision depth prediction; fine-tuning is required to exploit them fully (SpatialBot-Phi2-3B (RGB-D): >99%).
Generic VLMs and popular architectures (e.g., GPT-4o) show limited transfer to spatial tasks, especially for Position and Size.
Counting is less sensitive to depth or fine-tuning, maintaining >88% accuracy across models.
Reaching/contact and Size remain hard even with RGB-D inputs—suggesting these rely on high-level geometric or physical reasoning not captured by current LLM or VLM encoders (Cai et al., 2024).

5. Distinctiveness Versus Prior Benchmarks

Unlike VQA-v2, GQA, MMBench, or SceneGraph-QA, which are 2D-centric, SpatialBench leverages raw metric depth, physical-contact, and superlative object comparisons. In existing benchmarks, object proximity or “left/right” can be deduced from projected bounding boxes without 3D calibration; SpatialBench’s formulation explicitly requires metric depth for correctness (Cai et al., 2024).

SpatialBench also differs by issuing multiple diverse task types per image and using paired accuracy constraints to prevent superficial multi-task overfitting.

6. Implications and Utility

SpatialBench enables systematic ablations—comparing any new VLM variant (RGB-only, RGB-D, fine-tuned, instruction-tuned) for 3D reasoning capabilities. The suite is directly applicable as a diagnostic for

Embodied-AI agents with manipulation, navigation, or placement policies
Models employing point-cloud or depth-sensor extensions
Future work on spatial instruction-following, physical simulation, or comprehensive 3D visual understanding.

Notably, SpatialBench has already been used to validate that depth-API augmentation and progressive tuning (SpatialBot) increase raw depth estimation from ~70% to >99% and yield marked increases on proximity/contact queries (Cai et al., 2024). This suggests the benchmark is sensitive enough to guide model improvement and the design of spatially-aware VLMs.

7. Future Directions

SpatialBench points toward several promising avenues:

Developing VLMs that explicitly integrate scene geometry or 3D symbolic reasoning modules;
Extending to larger benchmarks and domain-specific splits (e.g., robotics, outdoor navigation);
Incorporating temporal sequences for tracking, sequential manipulation, or physically-plausible reasoning.

A plausible implication is that SpatialBench's design, with its multi-task span and metric depth requirements, forms a rigorous foundation for the next generation of embodied AI evaluations, complementing emerging 3D and spatially-focused datasets and protocols (Cai et al., 2024).

PDF Markdown Chat (Pro)

References (1)

SpatialBot: Precise Spatial Understanding with Vision Language Models (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to SpatialBench Benchmark.