SpatialQA-E: Autonomous Driving Spatial QA

Updated 15 February 2026

SpatialQA-E is a ground-truth QA benchmark that systematically evaluates spatial reasoning in urban autonomous driving scenarios using an automated 3D scene-graph pipeline.
It leverages the NuScenes dataset to generate approximately 3.3M qualitative, quantitative, and reasoning queries with deterministic evaluation metrics.
The dataset provides actionable insights into VLM spatial understanding challenges, promoting advancements in fine-grained, object-level scene analysis.

NuScenes-SpatialQA (SpatialQA-E) is a large-scale, ground-truth–based question-answering benchmark designed to systematically evaluate the spatial understanding and reasoning capabilities of Vision-LLMs (VLMs) in autonomous driving contexts. Constructed atop the NuScenes dataset and relying on an automated 3D scene graph and QA-template pipeline, SpatialQA-E focuses exclusively on evaluation, providing a singular, exhaustively instantiated set of qualitative, quantitative, and multi-hop spatial reasoning questions grounded in urban driving scenarios (Tian et al., 4 Apr 2025).

1. Dataset Construction and Structure

SpatialQA-E is built using the validation split of the NuScenes trainval-v1.0 data. This encompasses 150 urban driving scenes, each containing 40 key-frames, resulting in 6,000 key-frames. For every frame, six calibrated camera views (front, rear, and sides) are included, yielding a total of 36,000 paired images.

From this set, approximately 3.3 million question-answer (QA) pairs are generated, comprising:

≈ 2.5 million qualitative spatial understanding QA
≈ 0.6 million quantitative spatial understanding QA
≈ 0.2 million spatial reasoning QA (direct and situational)

No additional train/val/test partition is provided; SpatialQA-E serves as a single evaluation benchmark.

3D Scene-Graph Generation Pipeline:

Input: LiDAR-based 3D object annotations and calibrated camera parameters from NuScenes.
Processing:
1. 3D bounding boxes projected onto the 2D image plane using the official nuscenes-devkit script.
2. Objects with 2D boxes < 40 px in width/height or non-maximal visibility are filtered out.
3. All appearances of the same instance within a scene are grouped.
4. For each group, the best crop is selected via area and brightness scoring, with 100 px padding applied.
5. A concise noun-phrase caption for each crop is generated using LLaMA-3.2-Instruct.

Scene Graph Schema:

Nodes: Annotated with instance_token, category_name, 3D translation, size (length, width, height), and a generated caption.
Edges: For every ordered object pair, spatial_distance, longitudinal_offset, lateral_offset, and relative_bearing_angle are computed using standard metrics.

QA Generation Pipeline:

Operates per view, instantiating exhaustive question templates at two levels:
- Spatial Understanding (Qualitative/Quantitative)
- Spatial Reasoning (Direct/Situational)
Expands over all valid object pairs and attributes, ensuring template and category balance.

2. Question and Answer Taxonomy

SpatialQA-E features a hierarchically organized taxonomy:

Level 1: Spatial Understanding

Qualitative (≈ 2.5M QA): Binary or multiple-choice on relations such as above/below, left/right, front/behind, and comparative size (larger/smaller; longer/shorter; taller/shorter; wider/thinner).
- Example:
- Q: "Is the man in tan t-shirt and jeans to the left of the black sedan with red logo?"
- A: "yes"
Quantitative (≈ 0.6M QA): Numeric open-ended on absolute or relative position, distance, offsets, dimensions, and angles.
- Example:
- Q: "What is the distance between the pedestrian and the white van?"
- A: "1.25" (where
- $d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2 + (z_1 - z_2)^2}$ )

Level 2: Spatial Reasoning

Direct Reasoning (≈ 0.1M QA): Multi-hop or multi-choice, such as selecting the closest/largest object or querying existence within a constraint.
- Example:
- Q: "Which object is closest to the ego vehicle? (a) man in tan t-shirt; (b) cyclist on right; (c) black sedan; (d) white truck"
- A: "d"
Situational Reasoning (≈ 0.1M QA): Contextual, yes/no queries under additional dynamic or environmental assumptions (e.g. collision prediction).
- Example:
- Q: "Assume the distance between cyclist and pedestrian is decreasing at 2 m/s. Will they collide within 5 seconds?"
- A: "yes"

Templates are applied exhaustively to all suitable object pairs, resulting in balanced coverage of all qualitative (14 templates) and quantitative (10 templates) categories.

3. Annotation Schema and Data Format

SpatialQA-E is distributed as per-key-frame-view JSON files, structured as follows:

scene_id: NuScenes internal scene token
frame_index and camera_name: Precise frame and view identity
nodes: Objects with attributes (category_name, translation, size, caption)
edges: Pairwise spatial relations (spatial_distance, offsets, bearing)
qa_pairs:
- question_id
- question_text
- answer (string or numeric)
- involved_node_ids
- question_type
- ground_truth_value (for quantitative)

Annotation adheres to rigorous guidelines:

Only fully visible objects (visibility = 4) and bounding boxes ≥ 40×40 px
Concise noun phrases for captions
All spatial values derived directly from LiDAR data
No human re-labeling or LLM-based re-scoring is performed; evaluation is purely deterministic

4. Dataset Statistics and Analytical Features

SpatialQA-E is characterized by exhaustive, template-driven coverage:

Scale: 150 scenes, 6,000 frames, 36,000 images, ≈ 3.3 million QA pairs
Distribution:
- Qualitative (76%)
- Quantitative (18%)
- Spatial Reasoning (6%)
Answer Types:
- Binary yes/no: ≈ 60%
- Multiple-choice: ≈ 20%
- Numeric: ≈ 20%
Balancing: Template expansion and filtering enforce uniform coverage per category, with average question length 8–12 tokens
Filtering removes occluded or small objects, improving QA signal quality

A plausible implication is that this balancing strategy ensures diagnostic comprehensiveness and prevents under-representation of rare spatial relations.

5. Evaluation Protocol

Evaluation is strictly deterministic, relying on direct comparison of strings (for categorical QA) and tolerance-based checks (for numeric QA):

Closed-ended (binary/multiple-choice):

$\mathrm{Accuracy} = \frac{\#\,\text{Correct}}{\#\,\text{Total}}$

Quantitative (open-ended numeric):
- Tolerance-based Accuracy: Proportion of answers within $[75\%, 125\%]$ of ground truth
$A_\mathrm{tol} = \frac{\#\,\{\,|\mathrm{pred}_i - \mathrm{GT}_i|\le0.25\,\mathrm{GT}_i\,\}}{N}$ - Mean Absolute Error (MAE):

$\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^N \bigl|\mathrm{pred}_i - \mathrm{GT}_i\bigr|$
No LLM-based or subjective scoring is incorporated, ensuring result reproducibility and eliminating ambiguity.

6. Applications, Use Cases, and Limitations

SpatialQA-E is intended for benchmarking and diagnosis of spatial reasoning in VLMs relevant to autonomous driving. Key use cases include:

Systematic evaluation of spatial understanding modules in VLMs
Fine-tuning and developing specialized spatial mechanisms
Studying inter-camera consistency of spatial reasoning
Analyzing safe-driving performance in proximity and hazard scenarios

Limitations:

Scene diversity restricted to urban driving in Boston/Singapore with fair weather
Excludes occluded or small objects (< 40 px), limiting long-range perception assessment
Language output is template-constrained (no paraphrase or free-text answers)
Answers are limited to yes/no, multi-choice, or single numeric value
Only serves as a held-out test set: no train/test split, not suitable for supervised pretraining

A plausible implication is that, while robust for evaluation, SpatialQA-E may be less effective for tasks demanding broader environmental variability or linguistic diversity.

7. Context and Significance within Spatial Reasoning Evaluation

SpatialQA-E provides the first large-scale, ground-truth–based QA benchmark with systematic 3D spatial reasoning coverage for real-world driving scenarios, uniquely enabled by multi-view, object-level scene graphs and rigorous, LiDAR-grounded question/answer generation. This approach eliminates reliance on subjective or heuristic annotations and is designed to drive progress in spatial capabilities of VLMs for safety-critical domains (Tian et al., 4 Apr 2025). Early experiments reveal that even spatial-enhanced VLMs exhibit pronounced quantitative reasoning limitations relative to qualitative performance, highlighting open research challenges in fine-grained spatial scene understanding.

Markdown Upgrade to Chat

References (1)

NuScenes-SpatialQA: A Spatial Understanding and Reasoning Benchmark for Vision-Language Models in Autonomous Driving (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpatialQA-E Dataset.

SpatialQA-E: Autonomous Driving Spatial QA

1. Dataset Construction and Structure

2. Question and Answer Taxonomy

Level 1: Spatial Understanding

Level 2: Spatial Reasoning

3. Annotation Schema and Data Format

4. Dataset Statistics and Analytical Features

5. Evaluation Protocol

6. Applications, Use Cases, and Limitations

7. Context and Significance within Spatial Reasoning Evaluation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

SpatialQA-E: Autonomous Driving Spatial QA

1. Dataset Construction and Structure

2. Question and Answer Taxonomy

Level 1: Spatial Understanding

Level 2: Spatial Reasoning

3. Annotation Schema and Data Format

4. Dataset Statistics and Analytical Features

5. Evaluation Protocol

6. Applications, Use Cases, and Limitations

7. Context and Significance within Spatial Reasoning Evaluation

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research