NuScenes-SpatialQA: Spatial Reasoning in Driving
- NuScenes-SpatialQA is a benchmark that provides ground-truth-based QA for assessing spatial reasoning in autonomous driving scenarios using LiDAR data.
- It employs an automated 3D scene graph pipeline with task-driven QA templates to evaluate both qualitative and quantitative spatial attributes.
- The benchmark highlights challenges in numeric spatial inference and direct geometric reasoning, guiding improvements in vision-language models.
NuScenes-SpatialQA (SpatialQA-E) is a large-scale, ground-truth-based question answering (QA) benchmark designed specifically to evaluate spatial understanding and reasoning capabilities of Vision-LLMs (VLMs) in real-world autonomous driving scenarios. Built upon the nuScenes dataset, NuScenes-SpatialQA leverages a fully automated 3D scene graph pipeline and an extensive set of task-driven QA templates to systematically probe both qualitative and quantitative facets of spatial understanding, as well as multi-hop and situational reasoning critical to autonomous driving decision-making. Unlike prior benchmarks that focus on toy or synthetic indoor scenes, or that rely on monocular depth estimation, NuScenes-SpatialQA exploits precise LiDAR-based ground truth to provide unambiguous, metric-accurate supervision for spatial relations, distances, and object sizes, filling a long-standing gap in the evaluation of spatially-aware VLMs (Tian et al., 4 Apr 2025).
1. Objectives and Scope
NuScenes-SpatialQA targets two core objectives: (1) to provide the first large-scale, ground-truth-driven QA benchmark for systematically evaluating VLMs on spatial understanding and reasoning in urban driving environments, and (2) to focus on scene-centric spatial tasks directly relevant to critical driving decisions, such as navigation, obstacle negotiation, and occlusion handling.
The benchmark addresses deficiencies in conventional VQA datasets, which either lack driving-specific spatial question types, utilize noisy or depth-biased estimates, or are limited to non-driving domains. By leveraging the nuScenes dataset's LiDAR annotations, SpatialQA eliminates depth estimation biases and supports both qualitative (categorical/relational) and quantitative (metric, numeric) QA tasks spanning multi-hop reasoning.
2. Dataset Construction and Scene Graph Pipeline
SpatialQA-E is built on the validation split of nuScenes trainval-v1.0, yielding 150 scenes × 40 keyframes (at 2 Hz), each captured using six cameras with precise 3D bounding boxes.
The automated 3D scene graph construction pipeline proceeds as follows:
- Each ground-truth 3D object is projected onto all available camera views (using nuscenes-devkit). Objects are filtered by minimum visibility and 2D bounding box size.
- All appearances of a distinct instance across keyframes are grouped, and the best single crop is selected via a metric combining 2D bounding box area and brightness.
- Concise attribute-centric captions (e.g., "red sedan with black roof") are synthesized using LLaMA-3.2 with structured prompting.
- Scene graph nodes are identified by unique instance_token, and include attributes for category_name, translation (in ego-vehicle coordinates), dimensions, and caption.
- Scene graph edges capture spatial relations: Euclidean distance, longitudinal and lateral offsets relative to the ego vehicle's heading, and relative bearing angle computed as θ = atan2 in degrees.
3. QA Generation, Coverage, and Categories
NuScenes-SpatialQA features an automated QA generation pipeline, utilizing more than 35 reusable templates spanning spatial understanding and reasoning. The dataset contains approximately 3.5 million QA pairs:
- Qualitative spatial understanding (≈2.5M): Binary yes/no comparisons of position (above/below, left/right, front/behind) and size (larger/smaller, taller/shorter, wider/thinner, longer/shorter).
- Quantitative spatial understanding (≈0.6M): Numeric requests for absolute distances, offsets, object dimensions, and relative angles.
- Direct and situational spatial reasoning (≈0.2M): Multi-choice or binary reasoning about proximity, presence in defined regions, and scenario-based inference (e.g., collision prediction, occlusion, or parking fit).
QA generation operates at the level of 6,000 keyframes × 6 cameras per scene, ensuring broad coverage and redundancy for both frequent and rare object configurations.
4. Task Structure, Question Taxonomy, and Evaluation
SpatialQA-E evaluates VLMs along multiple axes:
- Relative positioning: Queries about categorical spatial relations, e.g., "Is the pedestrian to the left of the red car?"
- Size comparison: Pairwise/comparative questions about volume or specific dimensions.
- Distance and offset estimation: Numeric questions requiring precise metric inference, such as determining the Euclidean distance or relative offset in the ego-vehicle frame.
- Dimension and angle measurement: Extraction of object size or orientation, e.g., “What is the bearing angle of the truck relative to the car?”
- Counting and selection: Identification of nearest, farthest, or specific objects from a candidate set.
- Scenario-based inference: Relational or temporal/logical queries, such as "Will these objects collide in five seconds given specific velocities?"
Evaluation is performed using:
- Closed-ended accuracy for binary and categorical queries.
- Tolerance-based accuracy (TAcc): Proportion of numeric predictions within [75%, 125%] of ground truth.
- Mean absolute error (MAE): Average absolute error for each numeric prediction dimension.
5. Vision-LLM Baselines and Analysis
The benchmark includes extensive evaluation of both general-purpose and spatially-enhanced VLMs:
- LLaVA-v1.6 (Mistral-7B, Vicuna-7B/13B/34B), Llama-3.2-11B-Vision-Instruct, BLIP-2-Flan-T5-XL, Qwen2.5-VL-7B-Instruct, Deepseek-vl2-tiny, and SpatialRGPT (spatially fine-tuned with synthetic spatial data + CLIP-L encoder).
Notable empirical findings:
- Qualitative spatial understanding (accuracy ≈50–60%) remains modest across all baselines; spatially-enhanced VLMs, particularly SpatialRGPT, excel in size-based binary questions.
- Quantitative tasks are significantly more difficult (TAcc <20% for most models on distance/offset, with higher scores in dimension estimation for certain models).
- Spatial enhancement confers improvements for qualitative tasks (especially size/shape), but not for numeric estimation.
- Direct geometric reasoning is consistently more challenging than situational/semantic reasoning—models achieve >73% on situational reasoning but only ~41–58% on direct spatial queries.
An ablation study reveals that backbone choice (Mistral-7B > Vicuna-7B), larger parameter count, and prompt engineering (CoT generally degrades performance by 6–13 percentage points in spatial reasoning) all influence performance.
| Model | Qualitative Accuracy (%) | Quantitative TAcc (%) | Direct Reasoning TAcc (%) | Situational Reasoning TAcc (%) |
|---|---|---|---|---|
| LLaVA-v1.6 (M7B) | 53.3 | 35.5 | 48.5 | 73.5 |
| Llama-3.2-11B | 54.3 | 16.1 | 41.8 | 37.3 |
| Qwen2.5-VL-7B | 58.0 | 19.4 | 58.2 | 84.1 |
| SpatialRGPT | 59.8 | 14.6 | 45.5 | 80.8 |
6. Error Patterns, Challenges, and Underlying Model Limitations
Analysis of error modes reveals:
- Confusion in distinguishing axes (front/behind versus above/below) in cluttered, occluded scenes.
- High MAE and instability in numeric outputs, particularly for dimensions with high variance (width, height).
- Substantial challenges with long-range metric inference (e.g., distances >20 m), multi-object comparatives, and angle estimations (errors exceeding 100° MAE in some instances).
Underlying causes include the absence of dense, metric spatial supervision during pretraining, inductive bias from prompts or LLM reasoning, and insufficient extrapolation capability given existing spatially-enhanced fine-tuning regimes.
7. Implications, Comparative Context, and Future Directions
NuScenes-SpatialQA provides, for the first time, a comprehensive ground-truth, LiDAR-based QA testbed for spatially-oriented VLMs in urban driving. General VLMs suffer pronounced deficiencies in quantitative spatial inference, while current spatially-enhanced models only marginally improve qualitative or categorical tasks. The relative tractability of situational reasoning likely reflects reliance on pretrained commonsense or world-knowledge, rather than direct metric geometry.
Future work suggested by these findings includes:
- Incorporation of metric spatial objectives (e.g., 3D contrastive learning) into pretraining,
- Development of spatially-guided or geometric step-wise prompting,
- Extension of evaluation to broader traffic domains (highway, rural, adverse weather),
- Integration of explicit 3D scene representation modules (e.g., BEV encoders) within multi-modal foundation models.
NuScenes-SpatialQA establishes a new paradigm for rigorous, fine-grained assessment of spatial reasoning in vision-language systems, and directly addresses limitations highlighted in related graph-structured spatial representations (e.g., nuScenes Knowledge Graph (Mlodzian et al., 2023)), laying a robust groundwork for the next generation of spatially competent, physically grounded VLMs in autonomous systems.