Papers
Topics
Authors
Recent
Search
2000 character limit reached

SpatialQA-E: Autonomous Driving Spatial QA

Updated 15 February 2026
  • SpatialQA-E is a ground-truth QA benchmark that systematically evaluates spatial reasoning in urban autonomous driving scenarios using an automated 3D scene-graph pipeline.
  • It leverages the NuScenes dataset to generate approximately 3.3M qualitative, quantitative, and reasoning queries with deterministic evaluation metrics.
  • The dataset provides actionable insights into VLM spatial understanding challenges, promoting advancements in fine-grained, object-level scene analysis.

NuScenes-SpatialQA (SpatialQA-E) is a large-scale, ground-truth–based question-answering benchmark designed to systematically evaluate the spatial understanding and reasoning capabilities of Vision-LLMs (VLMs) in autonomous driving contexts. Constructed atop the NuScenes dataset and relying on an automated 3D scene graph and QA-template pipeline, SpatialQA-E focuses exclusively on evaluation, providing a singular, exhaustively instantiated set of qualitative, quantitative, and multi-hop spatial reasoning questions grounded in urban driving scenarios (Tian et al., 4 Apr 2025).

1. Dataset Construction and Structure

SpatialQA-E is built using the validation split of the NuScenes trainval-v1.0 data. This encompasses 150 urban driving scenes, each containing 40 key-frames, resulting in 6,000 key-frames. For every frame, six calibrated camera views (front, rear, and sides) are included, yielding a total of 36,000 paired images.

From this set, approximately 3.3 million question-answer (QA) pairs are generated, comprising:

  • ≈ 2.5 million qualitative spatial understanding QA
  • ≈ 0.6 million quantitative spatial understanding QA
  • ≈ 0.2 million spatial reasoning QA (direct and situational)

No additional train/val/test partition is provided; SpatialQA-E serves as a single evaluation benchmark.

3D Scene-Graph Generation Pipeline:

  • Input: LiDAR-based 3D object annotations and calibrated camera parameters from NuScenes.
  • Processing:

    1. 3D bounding boxes projected onto the 2D image plane using the official nuscenes-devkit script.
    2. Objects with 2D boxes < 40 px in width/height or non-maximal visibility are filtered out.
    3. All appearances of the same instance within a scene are grouped.
    4. For each group, the best crop is selected via area and brightness scoring, with 100 px padding applied.
    5. A concise noun-phrase caption for each crop is generated using LLaMA-3.2-Instruct.

Scene Graph Schema:

  • Nodes: Annotated with instance_token, category_name, 3D translation, size (length, width, height), and a generated caption.

  • Edges: For every ordered object pair, spatial_distance, longitudinal_offset, lateral_offset, and relative_bearing_angle are computed using standard metrics.

QA Generation Pipeline:

  • Operates per view, instantiating exhaustive question templates at two levels:

    • Spatial Understanding (Qualitative/Quantitative)
    • Spatial Reasoning (Direct/Situational)
  • Expands over all valid object pairs and attributes, ensuring template and category balance.

2. Question and Answer Taxonomy

SpatialQA-E features a hierarchically organized taxonomy:

Level 1: Spatial Understanding

  • Qualitative (≈ 2.5M QA): Binary or multiple-choice on relations such as above/below, left/right, front/behind, and comparative size (larger/smaller; longer/shorter; taller/shorter; wider/thinner).
    • Example:
    • Q: "Is the man in tan t-shirt and jeans to the left of the black sedan with red logo?"
    • A: "yes"
  • Quantitative (≈ 0.6M QA): Numeric open-ended on absolute or relative position, distance, offsets, dimensions, and angles.
    • Example:
    • Q: "What is the distance between the pedestrian and the white van?"
    • A: "1.25" (where
    • d=(x1−x2)2+(y1−y2)2+(z1−z2)2d = \sqrt{(x_1 - x_2)^2 + (y_1 - y_2)^2 + (z_1 - z_2)^2})

Level 2: Spatial Reasoning

  • Direct Reasoning (≈ 0.1M QA): Multi-hop or multi-choice, such as selecting the closest/largest object or querying existence within a constraint.
    • Example:
    • Q: "Which object is closest to the ego vehicle? (a) man in tan t-shirt; (b) cyclist on right; (c) black sedan; (d) white truck"
    • A: "d"
  • Situational Reasoning (≈ 0.1M QA): Contextual, yes/no queries under additional dynamic or environmental assumptions (e.g. collision prediction).
    • Example:
    • Q: "Assume the distance between cyclist and pedestrian is decreasing at 2 m/s. Will they collide within 5 seconds?"
    • A: "yes"

Templates are applied exhaustively to all suitable object pairs, resulting in balanced coverage of all qualitative (14 templates) and quantitative (10 templates) categories.

3. Annotation Schema and Data Format

SpatialQA-E is distributed as per-key-frame-view JSON files, structured as follows:

  • scene_id: NuScenes internal scene token
  • frame_index and camera_name: Precise frame and view identity
  • nodes: Objects with attributes (category_name, translation, size, caption)
  • edges: Pairwise spatial relations (spatial_distance, offsets, bearing)
  • qa_pairs:
    • question_id
    • question_text
    • answer (string or numeric)
    • involved_node_ids
    • question_type
    • ground_truth_value (for quantitative)

Annotation adheres to rigorous guidelines:

  • Only fully visible objects (visibility = 4) and bounding boxes ≥ 40×40 px
  • Concise noun phrases for captions
  • All spatial values derived directly from LiDAR data
  • No human re-labeling or LLM-based re-scoring is performed; evaluation is purely deterministic

4. Dataset Statistics and Analytical Features

SpatialQA-E is characterized by exhaustive, template-driven coverage:

  • Scale: 150 scenes, 6,000 frames, 36,000 images, ≈ 3.3 million QA pairs
  • Distribution:
    • Qualitative (76%)
    • Quantitative (18%)
    • Spatial Reasoning (6%)
  • Answer Types:
    • Binary yes/no: ≈ 60%
    • Multiple-choice: ≈ 20%
    • Numeric: ≈ 20%
  • Balancing: Template expansion and filtering enforce uniform coverage per category, with average question length 8–12 tokens
  • Filtering removes occluded or small objects, improving QA signal quality

A plausible implication is that this balancing strategy ensures diagnostic comprehensiveness and prevents under-representation of rare spatial relations.

5. Evaluation Protocol

Evaluation is strictly deterministic, relying on direct comparison of strings (for categorical QA) and tolerance-based checks (for numeric QA):

  • Closed-ended (binary/multiple-choice):

Accuracy=# Correct# Total\mathrm{Accuracy} = \frac{\#\,\text{Correct}}{\#\,\text{Total}}

  • Quantitative (open-ended numeric):

    • Tolerance-based Accuracy: Proportion of answers within [75%,125%][75\%, 125\%] of ground truth

    Atol=# { ∣predi−GTi∣≤0.25 GTi }NA_\mathrm{tol} = \frac{\#\,\{\,|\mathrm{pred}_i - \mathrm{GT}_i|\le0.25\,\mathrm{GT}_i\,\}}{N} - Mean Absolute Error (MAE):

    MAE=1N∑i=1N∣predi−GTi∣\mathrm{MAE} = \frac{1}{N}\sum_{i=1}^N \bigl|\mathrm{pred}_i - \mathrm{GT}_i\bigr|

  • No LLM-based or subjective scoring is incorporated, ensuring result reproducibility and eliminating ambiguity.

6. Applications, Use Cases, and Limitations

SpatialQA-E is intended for benchmarking and diagnosis of spatial reasoning in VLMs relevant to autonomous driving. Key use cases include:

  • Systematic evaluation of spatial understanding modules in VLMs
  • Fine-tuning and developing specialized spatial mechanisms
  • Studying inter-camera consistency of spatial reasoning
  • Analyzing safe-driving performance in proximity and hazard scenarios

Limitations:

  • Scene diversity restricted to urban driving in Boston/Singapore with fair weather
  • Excludes occluded or small objects (< 40 px), limiting long-range perception assessment
  • Language output is template-constrained (no paraphrase or free-text answers)
  • Answers are limited to yes/no, multi-choice, or single numeric value
  • Only serves as a held-out test set: no train/test split, not suitable for supervised pretraining

A plausible implication is that, while robust for evaluation, SpatialQA-E may be less effective for tasks demanding broader environmental variability or linguistic diversity.

7. Context and Significance within Spatial Reasoning Evaluation

SpatialQA-E provides the first large-scale, ground-truth–based QA benchmark with systematic 3D spatial reasoning coverage for real-world driving scenarios, uniquely enabled by multi-view, object-level scene graphs and rigorous, LiDAR-grounded question/answer generation. This approach eliminates reliance on subjective or heuristic annotations and is designed to drive progress in spatial capabilities of VLMs for safety-critical domains (Tian et al., 4 Apr 2025). Early experiments reveal that even spatial-enhanced VLMs exhibit pronounced quantitative reasoning limitations relative to qualitative performance, highlighting open research challenges in fine-grained spatial scene understanding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SpatialQA-E Dataset.