NuScenes-SpatialQA: Spatial Reasoning in Driving

Updated 22 March 2026

NuScenes-SpatialQA is a benchmark that provides ground-truth-based QA for assessing spatial reasoning in autonomous driving scenarios using LiDAR data.
It employs an automated 3D scene graph pipeline with task-driven QA templates to evaluate both qualitative and quantitative spatial attributes.
The benchmark highlights challenges in numeric spatial inference and direct geometric reasoning, guiding improvements in vision-language models.

NuScenes-SpatialQA (SpatialQA-E) is a large-scale, ground-truth-based question answering (QA) benchmark designed specifically to evaluate spatial understanding and reasoning capabilities of Vision-LLMs (VLMs) in real-world autonomous driving scenarios. Built upon the nuScenes dataset, NuScenes-SpatialQA leverages a fully automated 3D scene graph pipeline and an extensive set of task-driven QA templates to systematically probe both qualitative and quantitative facets of spatial understanding, as well as multi-hop and situational reasoning critical to autonomous driving decision-making. Unlike prior benchmarks that focus on toy or synthetic indoor scenes, or that rely on monocular depth estimation, NuScenes-SpatialQA exploits precise LiDAR-based ground truth to provide unambiguous, metric-accurate supervision for spatial relations, distances, and object sizes, filling a long-standing gap in the evaluation of spatially-aware VLMs (Tian et al., 4 Apr 2025).

1. Objectives and Scope

NuScenes-SpatialQA targets two core objectives: (1) to provide the first large-scale, ground-truth-driven QA benchmark for systematically evaluating VLMs on spatial understanding and reasoning in urban driving environments, and (2) to focus on scene-centric spatial tasks directly relevant to critical driving decisions, such as navigation, obstacle negotiation, and occlusion handling.

The benchmark addresses deficiencies in conventional VQA datasets, which either lack driving-specific spatial question types, utilize noisy or depth-biased estimates, or are limited to non-driving domains. By leveraging the nuScenes dataset's LiDAR annotations, SpatialQA eliminates depth estimation biases and supports both qualitative (categorical/relational) and quantitative (metric, numeric) QA tasks spanning multi-hop reasoning.

2. Dataset Construction and Scene Graph Pipeline

SpatialQA-E is built on the validation split of nuScenes trainval-v1.0, yielding 150 scenes × 40 keyframes (at 2 Hz), each captured using six cameras with precise 3D bounding boxes.

The automated 3D scene graph construction pipeline proceeds as follows:

Each ground-truth 3D object is projected onto all available camera views (using nuscenes-devkit). Objects are filtered by minimum visibility and 2D bounding box size.
All appearances of a distinct instance across keyframes are grouped, and the best single crop is selected via a metric combining 2D bounding box area and brightness.
Concise attribute-centric captions (e.g., "red sedan with black roof") are synthesized using LLaMA-3.2 with structured prompting.
Scene graph nodes are identified by unique instance_token, and include attributes for category_name, translation (in ego-vehicle coordinates), dimensions, and caption.
Scene graph edges capture spatial relations: Euclidean distance, longitudinal and lateral offsets relative to the ego vehicle's heading, and relative bearing angle computed as θ = atan2 $(y_2 - y_1, x_2 - x_1)$ in degrees.

3. QA Generation, Coverage, and Categories

NuScenes-SpatialQA features an automated QA generation pipeline, utilizing more than 35 reusable templates spanning spatial understanding and reasoning. The dataset contains approximately 3.5 million QA pairs:

Qualitative spatial understanding (≈2.5M): Binary yes/no comparisons of position (above/below, left/right, front/behind) and size (larger/smaller, taller/shorter, wider/thinner, longer/shorter).
Quantitative spatial understanding (≈0.6M): Numeric requests for absolute distances, offsets, object dimensions, and relative angles.
Direct and situational spatial reasoning (≈0.2M): Multi-choice or binary reasoning about proximity, presence in defined regions, and scenario-based inference (e.g., collision prediction, occlusion, or parking fit).

QA generation operates at the level of 6,000 keyframes × 6 cameras per scene, ensuring broad coverage and redundancy for both frequent and rare object configurations.

4. Task Structure, Question Taxonomy, and Evaluation

SpatialQA-E evaluates VLMs along multiple axes:

Relative positioning: Queries about categorical spatial relations, e.g., "Is the pedestrian to the left of the red car?"
Size comparison: Pairwise/comparative questions about volume or specific dimensions.
Distance and offset estimation: Numeric questions requiring precise metric inference, such as determining the Euclidean distance or relative offset in the ego-vehicle frame.
Dimension and angle measurement: Extraction of object size or orientation, e.g., “What is the bearing angle of the truck relative to the car?”
Counting and selection: Identification of nearest, farthest, or specific objects from a candidate set.
Scenario-based inference: Relational or temporal/logical queries, such as "Will these objects collide in five seconds given specific velocities?"

Evaluation is performed using:

Closed-ended accuracy for binary and categorical queries.
Tolerance-based accuracy (TAcc): Proportion of numeric predictions within [75%, 125%] of ground truth.
Mean absolute error (MAE): Average absolute error for each numeric prediction dimension.

5. Vision-LLM Baselines and Analysis

The benchmark includes extensive evaluation of both general-purpose and spatially-enhanced VLMs:

LLaVA-v1.6 (Mistral-7B, Vicuna-7B/13B/34B), Llama-3.2-11B-Vision-Instruct, BLIP-2-Flan-T5-XL, Qwen2.5-VL-7B-Instruct, Deepseek-vl2-tiny, and SpatialRGPT (spatially fine-tuned with synthetic spatial data + CLIP-L encoder).

Notable empirical findings:

Qualitative spatial understanding (accuracy ≈50–60%) remains modest across all baselines; spatially-enhanced VLMs, particularly SpatialRGPT, excel in size-based binary questions.
Quantitative tasks are significantly more difficult (TAcc <20% for most models on distance/offset, with higher scores in dimension estimation for certain models).
Spatial enhancement confers improvements for qualitative tasks (especially size/shape), but not for numeric estimation.
Direct geometric reasoning is consistently more challenging than situational/semantic reasoning—models achieve >73% on situational reasoning but only ~41–58% on direct spatial queries.

An ablation study reveals that backbone choice (Mistral-7B > Vicuna-7B), larger parameter count, and prompt engineering (CoT generally degrades performance by 6–13 percentage points in spatial reasoning) all influence performance.

Model	Qualitative Accuracy (%)	Quantitative TAcc (%)	Direct Reasoning TAcc (%)	Situational Reasoning TAcc (%)
LLaVA-v1.6 (M7B)	53.3	35.5	48.5	73.5
Llama-3.2-11B	54.3	16.1	41.8	37.3
Qwen2.5-VL-7B	58.0	19.4	58.2	84.1
SpatialRGPT	59.8	14.6	45.5	80.8

6. Error Patterns, Challenges, and Underlying Model Limitations

Analysis of error modes reveals:

Confusion in distinguishing axes (front/behind versus above/below) in cluttered, occluded scenes.
High MAE and instability in numeric outputs, particularly for dimensions with high variance (width, height).
Substantial challenges with long-range metric inference (e.g., distances >20 m), multi-object comparatives, and angle estimations (errors exceeding 100° MAE in some instances).

Underlying causes include the absence of dense, metric spatial supervision during pretraining, inductive bias from prompts or LLM reasoning, and insufficient extrapolation capability given existing spatially-enhanced fine-tuning regimes.

7. Implications, Comparative Context, and Future Directions

NuScenes-SpatialQA provides, for the first time, a comprehensive ground-truth, LiDAR-based QA testbed for spatially-oriented VLMs in urban driving. General VLMs suffer pronounced deficiencies in quantitative spatial inference, while current spatially-enhanced models only marginally improve qualitative or categorical tasks. The relative tractability of situational reasoning likely reflects reliance on pretrained commonsense or world-knowledge, rather than direct metric geometry.

Future work suggested by these findings includes:

Incorporation of metric spatial objectives (e.g., 3D contrastive learning) into pretraining,
Development of spatially-guided or geometric step-wise prompting,
Extension of evaluation to broader traffic domains (highway, rural, adverse weather),
Integration of explicit 3D scene representation modules (e.g., BEV encoders) within multi-modal foundation models.

NuScenes-SpatialQA establishes a new paradigm for rigorous, fine-grained assessment of spatial reasoning in vision-language systems, and directly addresses limitations highlighted in related graph-structured spatial representations (e.g., nuScenes Knowledge Graph (Mlodzian et al., 2023)), laying a robust groundwork for the next generation of spatially competent, physically grounded VLMs in autonomous systems.

Markdown Report Issue Upgrade to Chat

References (2)

NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving (2025)

nuScenes Knowledge Graph -- A comprehensive semantic representation of traffic scenes for trajectory prediction (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NuScenes-SpatialQA (SpatialQA-E).