Papers
Topics
Authors
Recent
Search
2000 character limit reached

NuScenes-SpatialQA: Spatial Reasoning in Driving

Updated 22 March 2026
  • NuScenes-SpatialQA is a benchmark that provides ground-truth-based QA for assessing spatial reasoning in autonomous driving scenarios using LiDAR data.
  • It employs an automated 3D scene graph pipeline with task-driven QA templates to evaluate both qualitative and quantitative spatial attributes.
  • The benchmark highlights challenges in numeric spatial inference and direct geometric reasoning, guiding improvements in vision-language models.

NuScenes-SpatialQA (SpatialQA-E) is a large-scale, ground-truth-based question answering (QA) benchmark designed specifically to evaluate spatial understanding and reasoning capabilities of Vision-LLMs (VLMs) in real-world autonomous driving scenarios. Built upon the nuScenes dataset, NuScenes-SpatialQA leverages a fully automated 3D scene graph pipeline and an extensive set of task-driven QA templates to systematically probe both qualitative and quantitative facets of spatial understanding, as well as multi-hop and situational reasoning critical to autonomous driving decision-making. Unlike prior benchmarks that focus on toy or synthetic indoor scenes, or that rely on monocular depth estimation, NuScenes-SpatialQA exploits precise LiDAR-based ground truth to provide unambiguous, metric-accurate supervision for spatial relations, distances, and object sizes, filling a long-standing gap in the evaluation of spatially-aware VLMs (Tian et al., 4 Apr 2025).

1. Objectives and Scope

NuScenes-SpatialQA targets two core objectives: (1) to provide the first large-scale, ground-truth-driven QA benchmark for systematically evaluating VLMs on spatial understanding and reasoning in urban driving environments, and (2) to focus on scene-centric spatial tasks directly relevant to critical driving decisions, such as navigation, obstacle negotiation, and occlusion handling.

The benchmark addresses deficiencies in conventional VQA datasets, which either lack driving-specific spatial question types, utilize noisy or depth-biased estimates, or are limited to non-driving domains. By leveraging the nuScenes dataset's LiDAR annotations, SpatialQA eliminates depth estimation biases and supports both qualitative (categorical/relational) and quantitative (metric, numeric) QA tasks spanning multi-hop reasoning.

2. Dataset Construction and Scene Graph Pipeline

SpatialQA-E is built on the validation split of nuScenes trainval-v1.0, yielding 150 scenes × 40 keyframes (at 2 Hz), each captured using six cameras with precise 3D bounding boxes.

The automated 3D scene graph construction pipeline proceeds as follows:

  • Each ground-truth 3D object is projected onto all available camera views (using nuscenes-devkit). Objects are filtered by minimum visibility and 2D bounding box size.
  • All appearances of a distinct instance across keyframes are grouped, and the best single crop is selected via a metric combining 2D bounding box area and brightness.
  • Concise attribute-centric captions (e.g., "red sedan with black roof") are synthesized using LLaMA-3.2 with structured prompting.
  • Scene graph nodes are identified by unique instance_token, and include attributes for category_name, translation (in ego-vehicle coordinates), dimensions, and caption.
  • Scene graph edges capture spatial relations: Euclidean distance, longitudinal and lateral offsets relative to the ego vehicle's heading, and relative bearing angle computed as θ = atan2(y2y1,x2x1)(y_2 - y_1, x_2 - x_1) in degrees.

3. QA Generation, Coverage, and Categories

NuScenes-SpatialQA features an automated QA generation pipeline, utilizing more than 35 reusable templates spanning spatial understanding and reasoning. The dataset contains approximately 3.5 million QA pairs:

  • Qualitative spatial understanding (≈2.5M): Binary yes/no comparisons of position (above/below, left/right, front/behind) and size (larger/smaller, taller/shorter, wider/thinner, longer/shorter).
  • Quantitative spatial understanding (≈0.6M): Numeric requests for absolute distances, offsets, object dimensions, and relative angles.
  • Direct and situational spatial reasoning (≈0.2M): Multi-choice or binary reasoning about proximity, presence in defined regions, and scenario-based inference (e.g., collision prediction, occlusion, or parking fit).

QA generation operates at the level of 6,000 keyframes × 6 cameras per scene, ensuring broad coverage and redundancy for both frequent and rare object configurations.

4. Task Structure, Question Taxonomy, and Evaluation

SpatialQA-E evaluates VLMs along multiple axes:

  • Relative positioning: Queries about categorical spatial relations, e.g., "Is the pedestrian to the left of the red car?"
  • Size comparison: Pairwise/comparative questions about volume or specific dimensions.
  • Distance and offset estimation: Numeric questions requiring precise metric inference, such as determining the Euclidean distance or relative offset in the ego-vehicle frame.
  • Dimension and angle measurement: Extraction of object size or orientation, e.g., “What is the bearing angle of the truck relative to the car?”
  • Counting and selection: Identification of nearest, farthest, or specific objects from a candidate set.
  • Scenario-based inference: Relational or temporal/logical queries, such as "Will these objects collide in five seconds given specific velocities?"

Evaluation is performed using:

  • Closed-ended accuracy for binary and categorical queries.
  • Tolerance-based accuracy (TAcc): Proportion of numeric predictions within [75%, 125%] of ground truth.
  • Mean absolute error (MAE): Average absolute error for each numeric prediction dimension.

5. Vision-LLM Baselines and Analysis

The benchmark includes extensive evaluation of both general-purpose and spatially-enhanced VLMs:

  • LLaVA-v1.6 (Mistral-7B, Vicuna-7B/13B/34B), Llama-3.2-11B-Vision-Instruct, BLIP-2-Flan-T5-XL, Qwen2.5-VL-7B-Instruct, Deepseek-vl2-tiny, and SpatialRGPT (spatially fine-tuned with synthetic spatial data + CLIP-L encoder).

Notable empirical findings:

  • Qualitative spatial understanding (accuracy ≈50–60%) remains modest across all baselines; spatially-enhanced VLMs, particularly SpatialRGPT, excel in size-based binary questions.
  • Quantitative tasks are significantly more difficult (TAcc <20% for most models on distance/offset, with higher scores in dimension estimation for certain models).
  • Spatial enhancement confers improvements for qualitative tasks (especially size/shape), but not for numeric estimation.
  • Direct geometric reasoning is consistently more challenging than situational/semantic reasoning—models achieve >73% on situational reasoning but only ~41–58% on direct spatial queries.

An ablation study reveals that backbone choice (Mistral-7B > Vicuna-7B), larger parameter count, and prompt engineering (CoT generally degrades performance by 6–13 percentage points in spatial reasoning) all influence performance.

Model Qualitative Accuracy (%) Quantitative TAcc (%) Direct Reasoning TAcc (%) Situational Reasoning TAcc (%)
LLaVA-v1.6 (M7B) 53.3 35.5 48.5 73.5
Llama-3.2-11B 54.3 16.1 41.8 37.3
Qwen2.5-VL-7B 58.0 19.4 58.2 84.1
SpatialRGPT 59.8 14.6 45.5 80.8

6. Error Patterns, Challenges, and Underlying Model Limitations

Analysis of error modes reveals:

  • Confusion in distinguishing axes (front/behind versus above/below) in cluttered, occluded scenes.
  • High MAE and instability in numeric outputs, particularly for dimensions with high variance (width, height).
  • Substantial challenges with long-range metric inference (e.g., distances >20 m), multi-object comparatives, and angle estimations (errors exceeding 100° MAE in some instances).

Underlying causes include the absence of dense, metric spatial supervision during pretraining, inductive bias from prompts or LLM reasoning, and insufficient extrapolation capability given existing spatially-enhanced fine-tuning regimes.

7. Implications, Comparative Context, and Future Directions

NuScenes-SpatialQA provides, for the first time, a comprehensive ground-truth, LiDAR-based QA testbed for spatially-oriented VLMs in urban driving. General VLMs suffer pronounced deficiencies in quantitative spatial inference, while current spatially-enhanced models only marginally improve qualitative or categorical tasks. The relative tractability of situational reasoning likely reflects reliance on pretrained commonsense or world-knowledge, rather than direct metric geometry.

Future work suggested by these findings includes:

  • Incorporation of metric spatial objectives (e.g., 3D contrastive learning) into pretraining,
  • Development of spatially-guided or geometric step-wise prompting,
  • Extension of evaluation to broader traffic domains (highway, rural, adverse weather),
  • Integration of explicit 3D scene representation modules (e.g., BEV encoders) within multi-modal foundation models.

NuScenes-SpatialQA establishes a new paradigm for rigorous, fine-grained assessment of spatial reasoning in vision-language systems, and directly addresses limitations highlighted in related graph-structured spatial representations (e.g., nuScenes Knowledge Graph (Mlodzian et al., 2023)), laying a robust groundwork for the next generation of spatially competent, physically grounded VLMs in autonomous systems.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NuScenes-SpatialQA (SpatialQA-E).