SpatialQA-E: Autonomous Driving Spatial QA
- SpatialQA-E is a ground-truth QA benchmark that systematically evaluates spatial reasoning in urban autonomous driving scenarios using an automated 3D scene-graph pipeline.
- It leverages the NuScenes dataset to generate approximately 3.3M qualitative, quantitative, and reasoning queries with deterministic evaluation metrics.
- The dataset provides actionable insights into VLM spatial understanding challenges, promoting advancements in fine-grained, object-level scene analysis.
NuScenes-SpatialQA (SpatialQA-E) is a large-scale, ground-truth–based question-answering benchmark designed to systematically evaluate the spatial understanding and reasoning capabilities of Vision-LLMs (VLMs) in autonomous driving contexts. Constructed atop the NuScenes dataset and relying on an automated 3D scene graph and QA-template pipeline, SpatialQA-E focuses exclusively on evaluation, providing a singular, exhaustively instantiated set of qualitative, quantitative, and multi-hop spatial reasoning questions grounded in urban driving scenarios (Tian et al., 4 Apr 2025).
1. Dataset Construction and Structure
SpatialQA-E is built using the validation split of the NuScenes trainval-v1.0 data. This encompasses 150 urban driving scenes, each containing 40 key-frames, resulting in 6,000 key-frames. For every frame, six calibrated camera views (front, rear, and sides) are included, yielding a total of 36,000 paired images.
From this set, approximately 3.3 million question-answer (QA) pairs are generated, comprising:
- ≈ 2.5 million qualitative spatial understanding QA
- ≈ 0.6 million quantitative spatial understanding QA
- ≈ 0.2 million spatial reasoning QA (direct and situational)
No additional train/val/test partition is provided; SpatialQA-E serves as a single evaluation benchmark.
3D Scene-Graph Generation Pipeline:
- Input: LiDAR-based 3D object annotations and calibrated camera parameters from NuScenes.
- Processing:
- 3D bounding boxes projected onto the 2D image plane using the official nuscenes-devkit script.
- Objects with 2D boxes < 40 px in width/height or non-maximal visibility are filtered out.
- All appearances of the same instance within a scene are grouped.
- For each group, the best crop is selected via area and brightness scoring, with 100 px padding applied.
- A concise noun-phrase caption for each crop is generated using LLaMA-3.2-Instruct.
Scene Graph Schema:
Nodes: Annotated with instance_token, category_name, 3D translation, size (length, width, height), and a generated caption.
Edges: For every ordered object pair, spatial_distance, longitudinal_offset, lateral_offset, and relative_bearing_angle are computed using standard metrics.
QA Generation Pipeline:
Operates per view, instantiating exhaustive question templates at two levels:
- Spatial Understanding (Qualitative/Quantitative)
- Spatial Reasoning (Direct/Situational)
- Expands over all valid object pairs and attributes, ensuring template and category balance.
2. Question and Answer Taxonomy
SpatialQA-E features a hierarchically organized taxonomy:
Level 1: Spatial Understanding
- Qualitative (≈ 2.5M QA): Binary or multiple-choice on relations such as above/below, left/right, front/behind, and comparative size (larger/smaller; longer/shorter; taller/shorter; wider/thinner).
- Example:
- Q: "Is the man in tan t-shirt and jeans to the left of the black sedan with red logo?"
- A: "yes"
- Quantitative (≈ 0.6M QA): Numeric open-ended on absolute or relative position, distance, offsets, dimensions, and angles.
- Example:
- Q: "What is the distance between the pedestrian and the white van?"
- A: "1.25" (where
- )
Level 2: Spatial Reasoning
- Direct Reasoning (≈ 0.1M QA): Multi-hop or multi-choice, such as selecting the closest/largest object or querying existence within a constraint.
- Example:
- Q: "Which object is closest to the ego vehicle? (a) man in tan t-shirt; (b) cyclist on right; (c) black sedan; (d) white truck"
- A: "d"
- Situational Reasoning (≈ 0.1M QA): Contextual, yes/no queries under additional dynamic or environmental assumptions (e.g. collision prediction).
- Example:
- Q: "Assume the distance between cyclist and pedestrian is decreasing at 2 m/s. Will they collide within 5 seconds?"
- A: "yes"
Templates are applied exhaustively to all suitable object pairs, resulting in balanced coverage of all qualitative (14 templates) and quantitative (10 templates) categories.
3. Annotation Schema and Data Format
SpatialQA-E is distributed as per-key-frame-view JSON files, structured as follows:
- scene_id: NuScenes internal scene token
- frame_index and camera_name: Precise frame and view identity
- nodes: Objects with attributes (category_name, translation, size, caption)
- edges: Pairwise spatial relations (spatial_distance, offsets, bearing)
- qa_pairs:
- question_id
- question_text
- answer (string or numeric)
- involved_node_ids
- question_type
- ground_truth_value (for quantitative)
Annotation adheres to rigorous guidelines:
- Only fully visible objects (visibility = 4) and bounding boxes ≥ 40×40 px
- Concise noun phrases for captions
- All spatial values derived directly from LiDAR data
- No human re-labeling or LLM-based re-scoring is performed; evaluation is purely deterministic
4. Dataset Statistics and Analytical Features
SpatialQA-E is characterized by exhaustive, template-driven coverage:
- Scale: 150 scenes, 6,000 frames, 36,000 images, ≈ 3.3 million QA pairs
- Distribution:
- Qualitative (76%)
- Quantitative (18%)
- Spatial Reasoning (6%)
- Answer Types:
- Binary yes/no: ≈ 60%
- Multiple-choice: ≈ 20%
- Numeric: ≈ 20%
- Balancing: Template expansion and filtering enforce uniform coverage per category, with average question length 8–12 tokens
- Filtering removes occluded or small objects, improving QA signal quality
A plausible implication is that this balancing strategy ensures diagnostic comprehensiveness and prevents under-representation of rare spatial relations.
5. Evaluation Protocol
Evaluation is strictly deterministic, relying on direct comparison of strings (for categorical QA) and tolerance-based checks (for numeric QA):
- Closed-ended (binary/multiple-choice):
- Quantitative (open-ended numeric):
- Tolerance-based Accuracy: Proportion of answers within of ground truth
- Mean Absolute Error (MAE):
- No LLM-based or subjective scoring is incorporated, ensuring result reproducibility and eliminating ambiguity.
6. Applications, Use Cases, and Limitations
SpatialQA-E is intended for benchmarking and diagnosis of spatial reasoning in VLMs relevant to autonomous driving. Key use cases include:
- Systematic evaluation of spatial understanding modules in VLMs
- Fine-tuning and developing specialized spatial mechanisms
- Studying inter-camera consistency of spatial reasoning
- Analyzing safe-driving performance in proximity and hazard scenarios
Limitations:
- Scene diversity restricted to urban driving in Boston/Singapore with fair weather
- Excludes occluded or small objects (< 40 px), limiting long-range perception assessment
- Language output is template-constrained (no paraphrase or free-text answers)
- Answers are limited to yes/no, multi-choice, or single numeric value
- Only serves as a held-out test set: no train/test split, not suitable for supervised pretraining
A plausible implication is that, while robust for evaluation, SpatialQA-E may be less effective for tasks demanding broader environmental variability or linguistic diversity.
7. Context and Significance within Spatial Reasoning Evaluation
SpatialQA-E provides the first large-scale, ground-truth–based QA benchmark with systematic 3D spatial reasoning coverage for real-world driving scenarios, uniquely enabled by multi-view, object-level scene graphs and rigorous, LiDAR-grounded question/answer generation. This approach eliminates reliance on subjective or heuristic annotations and is designed to drive progress in spatial capabilities of VLMs for safety-critical domains (Tian et al., 4 Apr 2025). Early experiments reveal that even spatial-enhanced VLMs exhibit pronounced quantitative reasoning limitations relative to qualitative performance, highlighting open research challenges in fine-grained spatial scene understanding.