DriveLM-nuScenes: Graph-VQA Benchmark
- DriveLM-nuScenes is a comprehensive benchmark featuring graph-structured QA and multi-modal sensor data for end-to-end driving reasoning.
- It leverages six RGB cameras, lidar, radar, and GPS/IMU data with an average of 91 QA pairs per frame to enable precise perception, prediction, and planning evaluations.
- It drives advanced vision-language research by integrating context-aware graph prompting, multi-round reasoning, and rigorous evaluation protocols for autonomous decision making.
DriveLM-nuScenes is a large-scale, graph-structured, multi-modal benchmark for end-to-end autonomous driving reasoning, perception, and planning, built directly on the nuScenes platform. It supports advanced Vision-LLM (VLM) research for driving by offering a comprehensive evaluation framework, diverse and fine-grained QA graph annotation, and rigorous open-loop and closed-loop task protocols. DriveLM-nuScenes has been widely adopted in competitive driving benchmarks due to its complex scenario coverage, graph-based question–answer structure, and relevance to next-generation end-to-end learning methods.
1. Dataset Structure and Annotation Protocol
DriveLM-nuScenes is instantiated directly from the nuScenes dataset by leveraging both raw sensor data (six RGB cameras covering 360° views, lidar point clouds, radar, GPS/IMU) and ground-truth 3D annotations. Its unique annotation process produces for each selected frame an extensive set of natural language question–answer (QA) pairs, which are systematically organized into directed graphs. QAs span the following hierarchical domains:
- Perception: Identification and detailed localization of key objects (e.g., vehicles, cyclists, traffic lights) and complex scene descriptions.
- Prediction: Inductive queries regarding future motions (e.g., “Is the pedestrian about to cross?”, “Which vehicle will yield?”), temporal relationships, and occlusion reasoning.
- Planning: Inference of appropriate actions and high-level maneuvers (e.g., “Should the ego vehicle brake?”, “What is the safest next move in the scenario?”), conditional on perception–prediction context.
Each frame in DriveLM-nuScenes is annotated with, on average, 91 QA pairs, and the total coverage extends to hundreds of thousands of samples and QAs (Sima et al., 2023). The annotation pipeline is semi-rule-based, using automated extraction from nuScenes and OpenLane-V2 for initial QA generation, followed by labor-intensive human annotation and multi-stage verification for logical coherence.
2. Graph-Structured Visual Question Answering (Graph VQA)
The Graph VQA ("GVQA") paradigm is central to DriveLM-nuScenes, distinguishing it from conventional VQA benchmarks (Qian et al., 2023, Sima et al., 2023):
- Graph Construction: Each frame’s QAs are represented as nodes. Directed edges encode both object-level logical dependencies (e.g., an object's state affects another’s prediction) and task-level sequencing (e.g., perception → prediction → planning).
- Multi-Round Reasoning: Unlike flat VQA, the model must process a reasoning sequence, where context from upstream nodes (answers to earlier questions) is explicitly provided as context to downstream nodes.
- Task-Transition Hierarchy: Separate graph edges capture transitions between high-level tasks—perception, prediction, planning, behavior aggregation, and finally motion output.
This graph-based structure enforces multi-step, context-aware, and causally coherent evaluation, closely mimicking real-world driving cognition processes.
3. Model Baselines, Training Strategies, and Task Formulation
The DriveLM Agent baseline (Sima et al., 2023) is a VLM (notably BLIP-2 and, in later work, InternVL or LLaVA derivatives) trained to perform graph VQA and end-to-end driving. Key architectural and methodological features include:
- Input Pipelines: Six multi-view camera images are processed as a unified high-resolution composite (e.g., arranged in a 2×3 grid at 2688×896 or pre-tiled to 896×448 per view) (Li et al., 10 Dec 2024, Huang et al., 5 Nov 2024). Object localization is augmented with normalized bounding boxes or center points, often converted via advanced segmentation (e.g., Segment Anything) for enhanced spatial grounding.
- Prompt Engineering: Explicit graph context (parent QA and answer) is concatenated as a contextual prefix for each question node. Full scene prompts include descriptions, object lists, depth estimates, and chain-of-thought (CoT) reasoning chains (Qian et al., 2023, Peng et al., 14 Sep 2025).
- Output Representations: For planning, models predict behavior tokens (e.g., "Accelerate", "Brake"), full natural language rationales, and discretized/plausible trajectories (as waypoints or by quantizing speed and heading).
Training leverages both supervised fine-tuning (SFT) with QA graphs and densely annotated scenes, and—for top results—complex hybrid losses including cross-entropy for language, coordinate/token loss for localization, and sometimes explicit collision or drivability penalization (Huang et al., 5 Nov 2024, Li et al., 10 Dec 2024). Leading models also utilize LoRA or DoRA for efficient adaptation (Peng et al., 14 Sep 2025).
4. Evaluation Protocols, Metrics, and Competition Results
DriveLM-nuScenes is the official dataset for major autonomous driving language–vision competitions (e.g., PRCV 2024, CVPR 2024 Auto Grand Challenge). Evaluation spans both open-ended and structured QA, as well as trajectory/motion prediction.
Key metrics include:
- Final Score: Aggregate of accuracy, BLEU-1/2/3/4, ROUGE-L, CIDEr, and ChatGPT-based evaluation (Li et al., 10 Dec 2024, Huang et al., 5 Nov 2024).
- Match/Localization: Alignment between model output bounding boxes/centers and ground-truth object annotations.
- Planning Quality: L2 distance between predicted and reference trajectories, and collision rates (percentage of predictions resulting in simulated collisions within the future horizon) (Huang et al., 25 Aug 2024, Li et al., 23 Jun 2025).
- VQA-Score: For models answering multi-step QAs, the correct rate over the graph, both on perception (object-level) and planning (decision-level) nodes.
- Generalization: Zero-shot and cross-domain evaluation, e.g., testing DriveLM-nuScenes-trained models directly on Waymo or CARLA data (Sima et al., 2023, Huang et al., 25 Aug 2024).
Top-performing systems—such as those based on InternVL-2.0 and fine-tuned LLaVA variants—integrate multi-view spatial handling, depth estimation (Depth Anything), graph context prompting, and chain-of-thought reasoning to achieve leaderboard scores between 0.60 and 0.78 (Huang et al., 5 Nov 2024, Peng et al., 14 Sep 2025).
5. Methodological Innovations and Ablation Analyses
DriveLM-nuScenes has catalyzed several advances in multimodal driving research:
- Reasoning-Decision Alignment: Models like RDA-Driver maximize alignment between CoT explanations and trajectory predictions by applying token-average scoring and contrastive ranking losses, improving both interpretability and planning accuracy (L2 error as low as 0.82m, collision rates ≈ 0.38%) (Huang et al., 25 Aug 2024).
- Spatiotemporal Embedding: Incorporation of explicit 3D tracking representations enables improved perception and trajectory forecasting, yielding ~9.5% accuracy gain and substantial boosts in semantic alignment metrics (Ishaq et al., 18 Mar 2025).
- Reinforcement Learning Optimization: Drive-R1 leverages reinforcement learning (GRPO) to refine reasoning paths, maximizing a composite reward balanced across trajectory accuracy and policy diversity. This achieves L2 trajectory errors below 0.6m at 3s and minimizes collision rates (~0.10) (Li et al., 23 Jun 2025).
- Depth and Spatial Context Integration: Augmenting textual prompts with object-level depth (75th percentile within bounding boxes) and spatial descriptors ("close"/"far") results in improved model reasoning on spatially ambiguous QAs (Peng et al., 14 Sep 2025).
- Annotation Preprocessing: Systematic transformation of center-point annotations to full bounding boxes using segmentation models improves visual grounding quality for multimodal LLMs (Li et al., 10 Dec 2024).
These design choices are validated through rigorous ablation, which demonstrates sensitivity to coordinate normalization, prompt structure, and input formatting (e.g., loss of precision with coordinate scaling reduces match scores).
6. Research Applications and Future Directions
DriveLM-nuScenes serves as a core benchmark for evaluating the interaction between vision-language reasoning, scene comprehension, and embodied planning under realistic urban driving scenarios. Its adoption has yielded:
- Improved Generalization: Models trained on DriveLM-nuScenes often transfer better to different domains or sensor configurations, especially when enhanced via domain-adaptive modules (e.g., Uni3D, multi-stage goal selection for trajectory prediction) (Zhang et al., 2023, Grimm et al., 24 Jul 2025).
- Objective QA Assessment: The structured graph QA format is leveraged in downstream benchmarks such as AutoDrive-QA, which converts open annotations into multiple-choice questions with domain-tuned distractors, allowing robust accuracy-based evaluation (e.g., GPT-4V reaches ≈69.57% overall accuracy, with substantial margin for improvement on prediction/planning tasks) (Khalili et al., 20 Mar 2025).
- Spatial Reasoning Benchmarks: SpatialQA and related efforts adapt the scene to systematic 3D scene graphs, enabling the diagnosis of strengths and weaknesses in spatial reasoning and geometric understanding (notably, high qualitative but poor quantitative spatial performance in large VLMs) (Tian et al., 4 Apr 2025).
Future research avenues prompted by the DriveLM-nuScenes paradigm include closed-loop evaluation protocols (to counteract metric bias from motion history), neural–symbolic integration (incorporating knowledge graphs and explicit map semantics for trajectory planning), and scaling of interactive reasoning for real-time driving assistant systems.
DriveLM-nuScenes is thus a foundational resource for advancing research at the intersection of language reasoning, visual perception, and robust autonomous decision making in realistic multiview urban driving contexts.