SpatialRGPT-Bench: A 3D Spatial Evaluation Benchmark

Updated 26 November 2025

SpatialRGPT-Bench is a 3D-grounded benchmark that assesses vision-language models' spatial reasoning using unified camera-centric annotations across varied environments.
It operationalizes qualitative relation classification and quantitative metric estimation—measuring accuracy, success rate, and relative errors—to evaluate object-centric spatial metrics.
The benchmark supports multiple prompting strategies, including region-aware masking and language-only object references, providing actionable insights for advancing spatial cognition in VLMs.

SpatialRGPT-Bench is a 3D-grounded evaluation benchmark designed to rigorously assess the spatial reasoning capabilities of vision-LLMs (VLMs). Developed to probe both fine-grained quantitative spatial measurements and qualitative spatial relationships across a wide variety of indoor, outdoor, and simulated environments, SpatialRGPT-Bench utilizes ground-truth 3D annotations and region-level prompts to evaluate models’ abilities to reason about object-centric spatial arrangements. It supports evaluation of a spectrum of VLM architectures and prompting strategies—including region-aware feature masking and pure language-based object reference—and has become pivotal for benchmarking spatial cognition in advanced VLMs such as SpatialRGPT and SD-VLM (Cheng et al., 3 Jun 2024, Chen et al., 22 Sep 2025).

1. Dataset Composition and Ground-Truth Structure

SpatialRGPT-Bench aggregates scenes from five well-established datasets: SUNRGBD, ARKitScenes, nuScenes, KITTI, and Hypersim, spanning real and synthetic indoor/outdoor environments. In each scene, objects are annotated with pre-computed Omni3D cuboids in a unified camera-centric 3D space. For every object, the dataset records 3D centroids and axis-aligned bounding-box dimensions (width, height, depth), enabling metric extraction of spatial relationships and physical attributes. The object vocabulary comprises 88 classes shared across environments (e.g., car, chair, lamp).

Each QA instance is formulated with respect to two region proposals—either masks or bounding boxes—referring to unique object instances. This structure supports region-aware evaluation as well as language-only referral settings. In total, SpatialRGPT-Bench contains 657 qualitative and 749 quantitative VQA pairs targeting "unseen" scenes with no overlap with any training set (Cheng et al., 3 Jun 2024).

2. Task Families and Question Formats

SpatialRGPT-Bench defines two core task domains, each operationalized as VQA exchanges relying on image and region or object references:

Qualitative Relation Classification: Tasks include binary classification of spatial relations such as Left/Right, Above/Below, Front/Behind, Big/Small (area-wise), Tall/Short (height-wise), and Wide/Thin (width-wise). Typical prompts ask "Is <object A> to the left of <object B>?" with expected answers of “Yes” or “No.”
Quantitative Metric Estimation: Tasks require precise estimation of object-centric spatial metrics:
- Direct (Euclidean) distance between object centroids
- Horizontal and vertical distances (x and z axes)
- Object width and height (bounding box extents)
- Direction angle between objects, reported as o’clock notation
- Questions are phrased as "How far apart are <object A> and <object B> in meters?" or "At what o’clock is <object B> from <object A>?" with answers in metric units.

This broad coverage enables evaluation of VLMs’ ability to discern both absolute spatial quantities and relative arrangements (Cheng et al., 3 Jun 2024, Chen et al., 22 Sep 2025).

3. Evaluation Protocols and Metrics

SpatialRGPT-Bench adopts stringent, task-tailored metrics, leveraging ground-truth 3D data for scoring:

Qualitative Tasks: Accuracy is defined as the proportion of correct {“Yes”, “No”} predictions:

$\mathrm{Acc} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\bigl(\hat{y}_i = y_i\bigr)$

Quantitative Tasks: Multiple metrics apply:
- Success Rate (SR): Correct if estimate lies within 25% of ground truth:
$\mathrm{SR} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}(|\hat{x}_i - x_i|\le 0.25\,x_i)$ - Mean Absolute Relative Error (MARE), for paired distances/sizes:

$\mathrm{MARE} = \mathbb{E}[|\hat d - d^*|/d^*]$ - Mean Absolute Distance Error (MDisE), for centroid distances:

$\mathrm{MDisE} = \frac{1}{N}\sum_{i=1}^N |\|\hat{p}_i\| - \|p_i\||$ - Directional Error (MDE): Mean angular difference between predicted and true horizontal direction vectors.

Answer parsing is performed with GPT-4-Turbo during recent SD-VLM evaluations, ensuring robust numeric extraction and consistency in score computation (Chen et al., 22 Sep 2025).

4. Prompting Strategies and Baseline Categories

SpatialRGPT-Bench supports diverse prompting strategies relevant to both region-aware and pure language-based VLMs:

Region-aware prompts: Input is image plus two pixel-masked regions, facilitating fine-grained localization and feature fusion.
Language-referral prompts: Object references are encoded solely in text, removing reliance on pixel-level localization; recent SD-VLM results use this protocol after re-annotating object references with Qwen2.5-VL in pure language form (Chen et al., 22 Sep 2025).
Blind LLMs: Baselines such as GPT-4 (text only) and GPT-4V (image + text) operate without explicit region masking.

Ablation protocols test the effect of input modality (mask vs. bounding box), depth encoding (RGB vs. RGBD), and region-awareness, producing comparative insights into architectural efficacy (Cheng et al., 3 Jun 2024, Chen et al., 22 Sep 2025).

5. Comparative Performance and Analysis

SpatialRGPT-Bench has facilitated wide benchmarking of VLMs. Summary results (quantitative SR and qualitative accuracy) include:

Model	Quant. SR % / Err	Qual. SR %
SD-VLM	33.3 / 0.51	65.5
SpatialRGPT	28.7 / 0.58	57.8
Intern-VL3-78B	23.5 / 1.32	62.2
Qwen2.5-VL-72B	16.3 / 1.05	61.4
GPT-4o	13.0 / 0.69	60.5
Gemini-2	23.0 / 2.14	57.6
LLaVA-1.5-7B	16.2 / 0.83	26.3
SpatialBot (Cai)	13.2 / 1.69	55.9

SD-VLM demonstrates state-of-the-art performance, exceeding the previous best SpatialRGPT by over 4.5 points in quantitative accuracy and reducing mean relative error from 0.58 to 0.51. Qualitative relational accuracy likewise increases to 65.5% (a +7.7 point gain over SpatialRGPT). Even GPT-4o and Gemini-2 lag substantially on precise 3D measurement tasks (Chen et al., 22 Sep 2025).

Ablation studies in SD-VLM indicate that depth positional encoding is critical: replacing DPE with alternate depth representations (“depth as image” or “depth as token”) diminishes quantitative SR by 5–10 points (Chen et al., 22 Sep 2025). Strengths are observed in height and vertical-distance estimation (SR ≈ 42%) and relative comparisons (left/right, front/behind ≥ 65% accuracy).

6. Limitations and Open Challenges

Several persistent challenges manifest in SpatialRGPT-Bench evaluations:

Width and Euclidean Distance Estimation: Width SR remains at 26%, and direct distance at 25%, suggesting difficulty when computing non-axis-aligned or diagonal metrics.
Error Robustness: Mean relative error for direct distances holds at approximately 0.5; half-meter estimation errors for one-meter queries are common (Chen et al., 22 Sep 2025).
This suggests that spatial generalization outside monotonic depth gradients is non-trivial, and that occlusion and off-axis geometry remain open bottlenecks.

Robust cross-domain generalization is visible—models trained solely on indoor MSMU data perform competitively on outdoor scenes—though fine-grained spatial perception continues to tax existing architectures.

7. Significance and Future Directions

SpatialRGPT-Bench delivers a rigorous, 3D-anchored framework for VLM spatial cognition assessment, making it a principal benchmark for spatially-augmented architectures. Its unique region-aware and object-centric design, along with extensive physical annotation, create a robust challenge for quantitative and relational reasoning. The integration of depth encoding and region masking has proven essential for high-fidelity spatial perception.

A plausible implication is that continued improvements in spatial architecture (e.g., universal depth encoding, scalable region grounding) are necessary for advancing robotic manipulation, embodied AI, and real-world physical reasoning (Cheng et al., 3 Jun 2024, Chen et al., 22 Sep 2025). The benchmark exposes key failure modes and facilitates controlled comparative studies, playing a central role in directing next-generation VLM research toward robust 3D spatial generalization.