Papers
Topics
Authors
Recent
2000 character limit reached

DriveSora: Diffusion Model for Autonomous Videos

Updated 24 November 2025
  • DriveSora is a diffusion-based video generation model that produces spatiotemporally consistent, multi-view driving videos using structured 3D scene layouts.
  • It integrates a Spatial-Temporal Diffusion Transformer with ControlNet to conditionally synthesize realistic training data for addressing long-tail failure modes in autonomous driving.
  • DriveSora enhances safety-critical planning by reducing collision rates and L2 errors, outperforming conventional retrieval- and augmentation-based methods.

DriveSora is a diffusion-based video generation model designed to produce spatiotemporally consistent, multi-view driving videos conditioned on structured 3D scene layouts, with the principal application of addressing long-tail failure modes in autonomous driving end-to-end (E2E) planning systems. It is the generative core of the CorrectAD pipeline, facilitating fully automated, controllable training data synthesis to correct rare, safety-critical errors that conventional retrieval- or augmentation-based methods cannot effectively address (Ma et al., 17 Nov 2025).

1. Motivation and Problem Setting

Data-driven E2E planners for autonomous vehicles, such as UniAD and VAD, exhibit pronounced brittleness in handling rare “long-tail” scenarios—characterized by low-visibility, dense traffic, and corner cases—leading to safety violations including egocentric collisions and infractions. These failure cases are formalized as:

Dfail={(X,Y)DtraintTe2e,j:pego(t)potherj(t)<ϵ}D^{fail} = \{ (X, Y) \in D^{train} \mid \exists t \leq T_{e2e}, \exists j: \| p_{ego}(t) - p_{other}^j(t) \| < \epsilon \}

with ϵ\epsilon as a safety threshold. Conventional retrieval-based approaches (e.g., AIDE) rely solely on existing data, lacking the ability to synthesize novel conditions or exercise fine-grained control. DriveSora addresses this limitation by enabling targeted, high-fidelity video generation precisely aligned with the underrepresented or hazardous scenarios identified by a “PM-Agent.”

Within CorrectAD, DriveSora’s role is to synthesize multi-view video data XgenX^{gen}, precisely aligned with semantic and geometric requirements R=(c,e)R = (c, e)—a scene caption and structured 3D layout—supplied by the PM-Agent. This enables direct, iterative self-correction of E2E planners by expanding the effective support of the training distribution with diverse, realistic, and annotated examples tailored to observed failure modes (Ma et al., 17 Nov 2025).

2. Model Architecture and Conditional Generation

DriveSora extends the Spatial-Temporal Diffusion Transformer (STDiT) architecture to multi-view, structured-layout-conditioned video synthesis, supporting precise control over scene content at both the semantic and pixel levels.

2.1 Conditioning Pipeline

  • Semantic Input: Scene caption cc (from a Vision-LLM) is embedded via a T5 encoder Etext(c)E_{text}(c).
  • 3D Layout Encoding:
    • Foreground (eforee^{fore}): Instances encoded as tuples {bn,mn,un,vn}n\{ b_n, m_n, u_n, v_n \}_n, representing bounding boxes, headings, instance IDs, and dense captions, are mapped to embedding vectors via Fourier features and MLPs: βbox=MLP(Fe(B)+Fe(M)+Fe(U)+Etext(V))\beta^{box} = \text{MLP}(\text{Fe}(B) + \text{Fe}(M) + \text{Fe}(U) + E_{text}(V)).
    • Background (ebacke^{back}): Rasterized colored road maps, encoded via a variational autoencoder into βroad=Eimage(eback)\beta^{road} = E_{image}(e^{back}).
  • Noise Latents: zinN(0,I)z_{in} \sim \mathcal{N}(0, I).

2.2 Core Diffusion and ControlNet Integration

  • STDiT Backbone applies self- and cross-attention over time, modalities, and spatial grids. Cross-attention aggregates text and layout control (k=[βbox;βtext]k = [\beta^{box}; \beta^{text}]).
  • ControlNet Transformer: A trainable copy of early STDiT blocks injects road layout control, with outputs added to the main generative pathway through zero-initialized convolutions, ensuring unbiased gradient flow at initialization.

2.3 Multi-View Spatial Attention

Spatial consistency across camera views is enforced by a parameter-free mechanism: latents are reshaped to [BT,VS,C][B \cdot T, V \cdot S, C], enabling attention to operate across all views at each timestamp, mixing spatial information and maintaining global coherence without additional model parameters.

2.4 Classifier-Free Guidance (CFG)

CFG operates by randomly dropping each condition (box, road, text) or all simultaneously with p=0.05p=0.05 during training. At inference, denoised predictions are linearly combined:

G^(z,efore,eback,c)=G(z,ϕ,ϕ,ϕ)+λtext[G(z,ϕ,ϕ,c)G(z,ϕ,ϕ,ϕ)]+λback[G(z,ϕ,eback,c)G(z,ϕ,ϕ,c)]+λfore[G(z,efore,eback,c)G(z,ϕ,eback,c)]\hat{G}(z, e^{fore}, e^{back}, c) = G(z, \phi, \phi, \phi) + \lambda_{text} [G(z, \phi, \phi, c) - G(z, \phi, \phi, \phi)] + \lambda_{back} [G(z, \phi, e^{back}, c) - G(z, \phi, \phi, c)] + \lambda_{fore} [G(z, e^{fore}, e^{back}, c) - G(z, \phi, e^{back}, c)]

with (λtext,λback,λfore)=(7.0,2.0,2.0)(\lambda_{text}, \lambda_{back}, \lambda_{fore}) = (7.0, 2.0, 2.0).

3. Diffusion Process and Training Objective

3.1 Forward and Reverse Processes

  • Forward Noising: At each timestep tt, noise is added according to:

q(xtxt1)=N(xt;1βtxt1,βtI),q(xtx0)=N(xt;αˉtx0,(1αˉt)I)q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I), \quad q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I)

with αt=1βt\alpha_t = 1 - \beta_t, αˉt=s=1tαs\bar{\alpha}_t = \prod_{s=1}^t \alpha_s.

  • Reverse Denoising: The generative process is parameterized as:

pθ(xt1xt,c)=N(xt1;μθ(xt,t,c),ΣtI)p_\theta(x_{t-1} \mid x_t, c) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t, c), \Sigma_t I)

3.2 Objective Function

Training minimizes the standard denoising score matching loss:

Lsimple=Ex0,ϵ,t[ϵϵθ(αˉtx0+1αˉtϵ,t,c)2]L_{simple} = \mathbb{E}_{x_0, \epsilon, t} [ \| \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t, c) \|^2 ]

No additional explicit KL-divergence or layout-specific losses are introduced. Temporal consistency arises from STDiT’s cross-frame attention. Spatial alignment emerges from architectural constraints and conditioning; no explicit LlayoutL_{layout} term is used.

4. Training Data, Hyperparameters, and Implementation

  • Datasets:
    • nuScenes: 700 train, 150 val, 6-view, 12 Hz, 20 s clips.
    • In-house: 3M train, 0.6M val, 6-view, 10 Hz, 15 s clips, with 36% lane-changes.
    • All frames resized to 512×512512 \times 512, using 16-frame clips.
  • Base Model: OpenSora 1.1 checkpoint, single-view finetune (30k steps), multi-view (25k steps), batch size 16, HybridAdam optimizer (2×1052 \times 10^{-5} learning rate).
  • Classifier-Free Drop Rates: 5% per condition, 5% all dropped.
  • Inference: Rectified flow sampling, 30 steps, \sim4 s/sample on A800 GPU.
  • Compute: Training on 8 × A800 GPUs for \sim72 hours.

5. Integration with Agentic Self-Correction

DriveSora is embedded in the CorrectAD pipeline, forming the generative “Data Department.” The PM-Agent, leveraging GPT-4o/VLM, classifies failures (foreground, background, weather), generates multimodal requirements R={(ci,ei)}R = \{(c_i, e_i)\}, which are then passed to DriveSora. Top-KK examples matching the textual description are retrieved from DtrainD^{train} to refine conditioning inputs.

DriveSora generates videos XigenX^{gen}_i that contain, by construction, the intended 3D bounding boxes, map layouts, and scene semantics—eliminating the need for auxiliary annotation. These synthetic examples are merged with the training corpus to fine-tune any E2E planner (FF), such as UniAD, VAD, or proprietary models. The CorrectAD loop iteratively detects new failures and generates targeted data, measurably improving planner robustness (Ma et al., 17 Nov 2025).

6. Quantitative and Qualitative Performance

6.1 End-to-End Planner Improvements

On nuScenes with UniAD initialization:

  • L2 (Avg): from 1.02 m (AIDE) to 0.98 m (CorrectAD), 4%-4\%
  • Collision (Avg): from 0.28% (AIDE) to 0.19% (CorrectAD), 32%-32\%

On the in-house planner:

  • L2 (Avg): from 0.85 m (baseline) to 0.62 m (CorrectAD), 27%-27\%
  • Hit Rate (Avg): from 0.77 to 0.82 (+5%+5\%)

6.2 Video Generation Metrics (nuScenes val set)

Generator FID CLIP FVD NDS ↑
MagicDrive-v2 20.91 85.25 94.84 35.79
Panacea 16.96 84.23 139.0 32.10
DriveSora 15.08 86.73 94.51 36.58

DriveSora demonstrates lower FID and FVD, and higher CLIP and NDS scores, indicating improved perceptual and compositional quality over state-of-the-art baselines.

6.3 Ablation and Qualitative Analyses

  • Combining PM-Agent and DriveSora yields maximal gains (e.g., L2=0.98 m, Collision=0.19%).
  • Multi-view spatial attention and multimodal prompting are critical for image quality.
  • CFG, with adaptive conditional dropout, is necessary for optimal metric performance.
  • Generator swap experiments confirm DriveSora’s superiority over Panacea for both video quality and downstream planning metrics.
  • Multiple CorrectionAD iterations progressively close the performance gap on failure distributions.

Qualitatively, DriveSora achieves superior spatiotemporal consistency, multi-view coherence, and precise instance/weather editing.

7. Limitations and Prospects

Current scope is limited to collision-type failures; planned extensions encompass lane violations and traffic infractions via richer benchmarks such as Bench2Drive and NAVSIM. DriveSora’s model size (1.1B parameters) and inference latency (\sim4 s/frame) preclude on-demand data generation for some applications. Potential enhancements include lightweight samplers (e.g., SANA) and distillation into student models. Integration with closed-loop simulators (NAVSIM) shows additional Planner Domain Metric Score (PDMS) gains (+0.9). The conditional generation scheme is suggestive of broader applicability to other modalities (e.g., LiDAR, radar) within a generalized “OmniGen” framework (Ma et al., 17 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DriveSora.