TraceSpatial-Bench: 3D Spatial Trace Benchmark

Updated 4 July 2026

TraceSpatial-Bench is a dedicated benchmark that evaluates multi-step 3D spatial tracing in robotics, ensuring semantic and collision-aware planning.
It employs 100 annotated real-world scenes with tasks like pick-and-place and push-and-pull to rigorously test 3D spatial reasoning.
The benchmark integrates both 2D and 3D evaluation criteria, assessing object localization, metric accuracy, and collision-free motion.

TraceSpatial-Bench is the dedicated benchmark introduced in "RoboTracer: Mastering Spatial Trace with Reasoning in Vision-LLMs for Robotics" for evaluating multi-step, metric-grounded spatial tracing in robotics-oriented vision-LLMs. It is designed to test the harder problem of generating a 3D spatial trace that is both semantically correct and collision-aware in cluttered real-world scenes, a setting that existing benchmarks mostly do not cover because they focus either on spatial understanding/referring or on 2D visual trace prediction rather than on complete executable spatial plans in 3D (Zhou et al., 15 Dec 2025).

1. Conceptual scope and motivation

TraceSpatial-Bench is motivated by the claim that spatial tracing sits at the intersection of two difficult abilities: 3D spatial referring and 3D spatial measuring. In the benchmark’s framing, 3D spatial referring means identifying the right objects or regions in a scene from language, such as “the rightmost hamburger” or “the mug left of the laptop,” while 3D spatial measuring means understanding absolute scale and metric quantities such as depth, distance, width, height, and object-to-object spacing (Zhou et al., 15 Dec 2025).

The benchmark is introduced because earlier benchmarks do not fully test this combined capability. Some focus on 2D referring; others focus on depth or measurement; and 2D trace benchmarks do not capture whether a predicted path is physically valid in 3D, for example whether it would collide with obstacles or “float” above surfaces. TraceSpatial-Bench therefore evaluates whether a model can produce a complete spatial plan: start from the correct object, end at the correct destination region, and travel through a feasible intermediate path (Zhou et al., 15 Dec 2025).

The benchmark is explicitly aimed at multi-step reasoning, metric-grounded reasoning, object-centric 3D trace generation, and real-world geometric feasibility. A central implication of this design is that success requires more than endpoint localization. The benchmark tests whether a model can reason over geometry while preserving semantic alignment to the language instruction.

2. Dataset composition and annotation structure

TraceSpatial-Bench contains 100 manually annotated real-world scenes. The scene sources are split between 51 scenes from CA-1M and 49 scenes from ScanNet. Its task categories comprise 82 Pick-and-Place samples and 18 Push-and-Pull samples (Zhou et al., 15 Dec 2025).

Each benchmark sample includes the source RGB image, the corresponding absolute depth map, full camera parameters, a 2D mask for the object to be moved, a 3D bounding box for the target destination, and a reference 3D trajectory. The trajectory is described as a reference, not as the only valid answer. Because multiple collision-free motions may exist, evaluation is based on whether the predicted trace is feasible and correct under the benchmark criteria rather than on exact token-for-token path matching (Zhou et al., 15 Dec 2025).

The benchmark samples require between 3 and 8 reasoning steps, while the paper also describes the step statistics more generally as spanning step counts from 2 to 8 in the annotation analysis. The prompt statistics reported for step complexity are as follows.

Step count	Prompts	Avg. words
2	7	10.14
3	16	14.81
4	16	17.19
5	28	22.29
6	21	27.48
7	7	30.86
8	5	34.60

These statistics show that as the reasoning chain gets longer, the language becomes more elaborate and the trace planning problem becomes more compositional. A common misunderstanding is to treat the benchmark as a simple destination-prediction dataset. Its annotation structure indicates otherwise: it is organized around start-object grounding, target-region grounding, and feasible intermediate motion in 3D.

3. Task formulation and trace representation

TraceSpatial-Bench focuses on object-centric spatial tracing. Its two core benchmark task types are Pick-and-place, in which the model must identify a source object and a destination region and then predict a feasible spatial path, and Push-and-pull, which has the same general structure but uses pushing or pulling rather than grasp-and-carry (Zhou et al., 15 Dec 2025).

The benchmark situates these tasks within a broader set of spatial reasoning subskills: 3D spatial referring, 3D spatial measuring, Multi-step trace generation, and Collision-free motion reasoning. For this reason, the task is characterized not merely as localization but as a reasoning-over-geometry task (Zhou et al., 15 Dec 2025).

A sample is organized around a scene image, a depth map, camera intrinsics/extrinsics, a source object mask, a destination 3D bounding box, and a reference trace. The input query asks the model to produce a spatial trace that moves the source object to the destination under the stated constraints. The expected output format is a list of $(u,v,d)$ waypoints with normalized image coordinates and metric depth (Zhou et al., 15 Dec 2025).

The paper uses a unified trace representation: $\tau = \{p_t\}_{t=1}^{T}, \quad p_t = (u_t,v_t,d_t)$ where $u_t, v_t$ are image-plane coordinates and $d_t$ is absolute depth. This representation can be projected to 2D or lifted to 3D, which makes the benchmark compatible with both image-plane and metric-space evaluation (Zhou et al., 15 Dec 2025).

Representative examples illustrate the intended behavior. One benchmark-style example is: “Pick the rightmost hamburger and place it on the keyboard in front of the laptop without collisions.” In the paper’s description, the source is the rightmost hamburger, the destination is the keyboard region in front of the laptop, and the expected trace starts on the hamburger mask, moves through free space, and ends in the keyboard’s 3D destination box while avoiding obstacles such as the doll and screen. Another motivating example is “Water flowers from right to left with watering can hovering 1–5 cm above each flower.” This example emphasizes that spatial tracing may require both object ordering and metric height control (Zhou et al., 15 Dec 2025).

4. Evaluation protocol and metrics

TraceSpatial-Bench evaluates performance at both the 2D projection level and the 3D geometry level. It also defines “overall success” by combining localization correctness with collision-free feasibility (Zhou et al., 15 Dec 2025).

The benchmark’s success criteria are:

2D Start Success: a prediction succeeds if the predicted start point lies inside the ground-truth 2D mask of the source object.
2D End Success: a prediction succeeds if at least one of the final three predicted points lies inside the projected 2D destination bounding box.
3D Start Success: a prediction succeeds if the predicted 3D start point is within 20 cm of the target object’s point cloud.
3D End Success: a prediction succeeds if at least one of the final three predicted 3D points is within 20 cm of the 3D destination bounding box (Zhou et al., 15 Dec 2025).

A task is considered successful in Overall 3D Success only if three conditions hold: the start is correct in 3D, the end is correct in 3D, and the simulated movement along the predicted trace is collision-free. The collision condition is defined so that, during simulated movement, no more than 20% of the object’s points may intersect the environment’s 3D occupancy map (Zhou et al., 15 Dec 2025).

For benchmark reporting, results are summarized as success rates over all $N$ samples: $\text{ASR} = \frac{1}{N}\sum_{i=1}^{N}\mathbb{1}[\text{trace}_i \text{ is correct}]$ In the benchmark tables, “Overall” corresponds to this type of aggregate success metric after applying the 2D and 3D start-end criteria and collision-free validation (Zhou et al., 15 Dec 2025).

The use of distance thresholds rather than exact coordinate equality is a deliberate feature of the benchmark. The 3D start criterion is within 20 cm of the target object point cloud, and the 3D end criterion is within 20 cm of the destination box. This reflects the fact that a feasible trace in real 3D space does not require exact pointwise equality; it requires geometric validity (Zhou et al., 15 Dec 2025).

The paper also defines several reward-training metrics that clarify the benchmark’s evaluation philosophy. The Point reward is

$R_P = \tfrac{1}{2}\bigl[f(p_1, \hat{p}_1) + f(p_T, \hat{p}_T)\bigr]$

with

$f(p,p') = \max\bigl(0,\;1-\|p-p'\|_2^2\bigr).$

The Trace reward is

$R_T = \max\bigl(0,\;1-d(\tau,\hat{\tau})\bigr),$

where $d(\tau,\hat{\tau})$ is a trajectory distance metric. For referring, 2D coordinate error is measured with an L1-style criterion and a threshold based on image size, and depth error is judged within a relative tolerance. For measuring and scale, a prediction is correct if it lies within $\tau = \{p_t\}_{t=1}^{T}, \quad p_t = (u_t,v_t,d_t)$ 0 of ground truth. These are not the benchmark’s final score, but they show how the underlying system defines metric sensitivity (Zhou et al., 15 Dec 2025).

5. Reported performance and empirical difficulty

The paper reports that TraceSpatial-Bench is very difficult for existing models. On Overall 3D Success, Gemini-2.5-Pro is reported at 3% overall, Qwen3-VL-8B at 8%, and the strongest reported configuration, RoboTracer-2B-RFT with R.I.D. + checkmark, reaches 45% overall (Zhou et al., 15 Dec 2025).

Model	Input	Overall
Gemini-2.5-Pro	RGB	3
Qwen3-VL-4B	RGB	6
Qwen3-VL-8B	RGB	8
RoboTracer-2B-SFT	RGB	31
RoboTracer-2B-SFT	R.I.D.	39
RoboTracer-2B-RFT	RGB + checkmark	39
RoboTracer-2B-RFT	R.I.	40
RoboTracer-2B-RFT	R.I.D. + checkmark	45

The full table further reports the 2D Start, 2D End, 3D Start, and 3D End values. For example, Gemini-2.5-Pro records 31 on 2D Start, 33 on 2D End, 9 on 3D Start, 16 on 3D End, and 3 on Overall, while RoboTracer-2B-RFT with R.I.D. + checkmark records 69, 53, 78, 61, and 45, respectively (Zhou et al., 15 Dec 2025).

The paper states that RoboTracer achieves state-of-the-art performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. It also states that general VLM baselines exhibit extremely low overall performance, which is consistent with the benchmark’s requirement that a prediction be simultaneously semantically correct, metrically grounded, and collision-free (Zhou et al., 15 Dec 2025).

The benchmark’s own usage guidance provides an interpretation framework for these scores. High 2D but low 3D performance usually means that a model can localize objects in the image but lacks true metric depth understanding. High start/end but low overall success usually means that the model can find endpoints but fails to generate a physically feasible path. Improvements with richer geometry inputs indicate that explicit depth and camera geometry help metric reasoning (Zhou et al., 15 Dec 2025). This suggests that the benchmark is especially diagnostic for disentangling image-space grounding from executable 3D planning.

6. Position within spatial evaluation research

TraceSpatial-Bench occupies a distinct position within recent spatial benchmarking. The paper introduces it specifically because existing benchmarks mostly cover either spatial understanding/referring or 2D visual trace prediction, not the generation of a 3D spatial trace that is both semantically correct and collision-aware in cluttered real-world scenes (Zhou et al., 15 Dec 2025).

This differentiates it from benchmarks such as SpatialBench, which evaluates multimodal LLMs through a five-level hierarchy of spatial cognition and 15 video-based spatial tasks, ranging from basic observation to high-level planning (Xu et al., 26 Nov 2025). It also differs from SpatialBench-UC, which is a small, reproducible benchmark for pairwise spatial relations in text-to-image generation, with a checker that may output PASS, FAIL, or UNDECIDABLE and emphasizes risk–coverage tradeoffs under uncertainty (Rostane, 19 Jan 2026).

TraceSpatial-Bench is narrower in modality and broader in executional constraint. It is narrower because it is centered on object-centric trace generation rather than on general multimodal spatial cognition or text-to-image prompt following. It is broader in physical commitment because correctness requires feasible intermediate motion and collision validation in 3D. A plausible implication is that it serves as a bridge between perception-oriented spatial evaluation and embodied task execution.

A recurrent misconception in spatial evaluation is that correct endpoint prediction is sufficient evidence of spatial competence. TraceSpatial-Bench explicitly rejects that assumption. Its structure requires the model to start from the correct object, reach the correct destination region, and traverse a feasible path under real-world geometric constraints. For that reason, it is best understood as a benchmark of executable spatial planning rather than of isolated spatial description.