Papers
Topics
Authors
Recent
Search
2000 character limit reached

DriveDreamer-2: LLM-Enhanced Driving Video Synthesis

Updated 8 February 2026
  • DriveDreamer-2 is an LLM-augmented driving video generation system that employs a modular pipeline for trajectory synthesis, HDMap creation, and unified multi-view video generation.
  • It uses a fine-tuned GPT model to convert natural language queries into agent trajectories and a ControlNet-style latent diffusion process to produce BEV HDMaps that enforce traffic rules.
  • The unified multi-view video model achieves state-of-the-art FID and FVD scores, demonstrating significant improvements in downstream 3D detection and tracking tasks through synthetic data augmentation.

DriveDreamer-2 is an LLM-augmented world model for multi-view driving video generation, designed to enable user-customized, scenario-diverse, and temporally coherent synthesis of driving scenes. Building upon the original DriveDreamer pipeline, it introduces a LLM trajectory interface, a ControlNet-style conditional latent diffusion HDMap generator, and a Unified Multi-View Video Model (UniMVM) to generate surround-view video consistent with user-specified scenarios and traffic rules. The system is demonstrated to provide state-of-the-art video generation quality as evaluated by Fréchet Inception Distance (FID) and Fréchet Video Distance (FVD), and to offer measurable benefits as synthetic data for downstream 3D detection and tracking tasks (Zhao et al., 2024).

1. System Architecture and Pipeline

DriveDreamer-2 follows a strictly modular pipeline characterized by three primary components: an LLM-based trajectory interface, a BEV (bird’s-eye view) HDMap generator, and a unified multi-view video generation model.

  1. LLM-Based Trajectory Interface: Natural-language driving scenario queries qQq\in\mathcal{Q} (e.g., "on a rainy day a car cuts in") are mapped by a fine-tuned GPT-3.5 model into agent trajectories T={T(j)}j=1M\mathbf{T}=\{T^{(j)}\}_{j=1}^M, where T(j)={(xt(j),yt(j),θt(j))}t=1TT^{(j)}=\{(x_t^{(j)},y_t^{(j)},\theta_t^{(j)})\}_{t=1}^T encodes per-agent 2D positions and headings. The LLM is fine-tuned using a text-to-Python-script dataset, calling primitives such as agent.cut_in(), agent.u_turn(), and pedestrian.walk(). This structured approach supports deterministic trajectory outputs matching high-level scene constraints.
  2. BEV HDMap Generation: Given a rasterized BEV trajectory map TbR3×Hb×Wb\mathcal{T}_b\in\mathbb{R}^{3\times H_b\times W_b}, a conditional latent diffusion network produces a BEV HDMap HbR3×Hb×Wb\mathcal{H}_b\in\mathbb{R}^{3\times H_b\times W_b} with channels for lane boundaries, dividers, and crossings. The loss is a ControlNet-style denoising score-matching objective:

minϕ  EZ0,ϵ,t  ϵϵϕ(Zt,t,Tb)22\min_{\phi}\; \mathbb{E}_{\mathcal{Z}_0,\epsilon,t}\;\Big\|\epsilon - \epsilon_\phi(\mathcal{Z}_t, t, \mathcal{T}_b)\Big\|_2^2

where Zt\mathcal{Z}_t is a latent obtained through forward diffusion. The co-registration of trajectory and map elements in BEV enforces traffic regulations in generated HDMaps.

  1. Unified Multi-View Video Model (UniMVM): A set of per-frame HDMaps {Hi}i=1N\{\mathcal{H}_i\}_{i=1}^N and 3D box-rasters {Bi}i=1N\{\mathcal{B}_i\}_{i=1}^N condition UniMVM, which generates surround-view video VRK×N×3×H×W\mathbf{V}\in\mathbb{R}^{K\times N\times3\times H\times W} for K=6K=6 synchronized cameras. UniMVM concatenates all KK camera frames along the width into a "panorama" x\mathbf{x}', ensuring spatial-temporal and cross-view consistency via a single diffusion step. Video generation is non-autoregressive, conditioned on latent embeddings of the HDMap, object boxes, and prompt.

2. LLM-Driven Trajectory Generation

DriveDreamer-2's use of an LLM for trajectory synthesis is notable for its direct mapping from user intent to multi-agent motion plans. The system is trained on text-to-script pairs, allowing the LLM to select and compose reusable driving maneuvers (cut-in, U-turn, braking, pedestrian crossing) through interpretable Python primitives. At inference, queries deterministically produce per-agent trajectories T\mathbf{T}, enabling faithful reproduction of rare or complex traffic events. This module enables high-level scenario diversity and supports straightforward user customization of scene semantics (Zhao et al., 2024).

3. BEV HDMap Synthesis

The BEV HDMap generator is realized as a ControlNet-inspired conditional latent diffusion process. Input trajectories are rasterized into three-channel images distinguishing ego, vehicles, and pedestrians by color encoding. Output maps comprise lane boundaries, dividers, and crossings, with the generator trained to preserve correct traffic topology by leveraging the geometric co-registration within the BEV. The map diffusion process operates on a 512×512512\times512 spatial resolution with a batch size of 24, and is optimized for 55,000 iterations via AdamW with a learning rate of 5×1055\times10^{-5} on NVIDIA A800 hardware. No adversarial or perceptual losses are used; semantic fidelity is maintained solely through the diffusion score-matching objective.

4. Unified Multi-View Video Generation

The UniMVM synthesizes temporally and spatially coherent surround-view video, solving key limitations of prior view-wise or back-projection-based approaches. All K=6K=6 view frames per timestep are concatenated along the width dimension, yielding a multi-camera panorama that is processed by an encoder–decoder U-Net with 3D temporal convolutions and cross-attention to the conditioning vector cc. Masked video infilling is supported by parameterizing the generative distribution over masked and observed regions:

p(xc)=p(xmc)  p(x(1m)xm,c)p(\mathbf{x}'\mid c) = p\bigl(\mathbf{x}'\cdot m\mid c\bigr)\;p\bigl(\mathbf{x}'\cdot(1-m)\mid \mathbf{x}'\cdot m, c\bigr)

The video diffusion model is trained for 200,000 iterations at batch size 1, with spatial resolution 256×448256\times448 and video length N=8N=8, using only ground-truth pixel supervision and no adversarial or perceptual losses.

5. Training Protocol and Supervision

DriveDreamer-2 leverages the nuScenes dataset (700 training, 150 validation sequences, six surround cameras at 12 Hz, \sim1 million frames). Ground-truth projections, HDMaps, and 3D bounding boxes are rasterized at aligned resolutions. Video frames are downsampled to 4 Hz for evaluation and training, and synthetic video clips of length N=8N=8 are used as training examples. Training objectives are strictly based on score-matching for both HDMap and video generation modules, reflecting the system’s reliance on expressive latent diffusion architectures.

Training Parameters Table

Component Iterations Batch Size Resolution Learning Rate
HDMap Generator 55,000 24 512 × 512 5×1055\times10^{-5}
Video Generator (UniMVM) 200,000 1 256 × 448 × 8 5×1055\times10^{-5}

6. Quantitative and Qualitative Evaluation

Video synthesis quality is assessed via two metrics:

  • Fréchet Inception Distance (FID):

FID(r,g)=μrμg2+Tr(Cr+Cg2(CrCg)1/2)\mathrm{FID}(r,g) = \|\mu_r-\mu_g\|^2 + \mathrm{Tr}(C_r + C_g -2(C_rC_g)^{1/2})

using Inception v3 features to compare real and generated frames.

  • Fréchet Video Distance (FVD):

Analogously defined using I3D video features to capture temporal dynamics.

Under first-frame multi-view conditioning, DriveDreamer-2 achieves FID = 11.2 and FVD = 55.7, compared to previous bests of approximately 16.9 (FID) and 139 (FVD), representing relative reductions of 34% and 60%. Without any image conditioning, FID = 25.0 and FVD = 105.1 (relative improvements of 30% and 70% over corresponding unconditional baselines). The generated videos maintain consistent object geometry, appearance, and background across all views and frames, surpassing the temporal and spatial coherence of prior methods (Zhao et al., 2024).

7. Downstream Impact: Data Augmentation in 3D Perception Tasks

Integration of DriveDreamer-2–generated synthetic videos as data augmentation for the StreamPETR 3D detection and tracking pipeline results in measurable improvements:

  • 3D Detection (mAP / NDS):
    • Real only: 31.7 / 43.5.
    • + Synthetic (first frame conditioned): 32.6 / 45.2 (+3.9%+3.9\% NDS).
    • + Synthetic (unconditioned): 32.9 / 45.4 (+4.4%+4.4\% NDS).
  • Multi-Object Tracking (AMOTA / AMOTP):
    • Real only: 28.9 / 1.419.
    • + Synthetic: 31.3 / 1.387 (+8.3%+8.3\% AMOTA, 2.3%-2.3\% AMOTP).

These findings confirm that synthetic videos generated by DriveDreamer-2 deliver improvements in downstream 3D perception benchmarks, attributed to the high fidelity and scenario diversity enabled by the LLM-augmented pipeline.


For detailed architectural and methodological specifics, as well as comprehensive benchmarks and ablation studies, see "DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation" (Zhao et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DriveDreamer-2.