DriveSora: Diffusion Model for Autonomous Videos

Updated 24 November 2025

DriveSora is a diffusion-based video generation model that produces spatiotemporally consistent, multi-view driving videos using structured 3D scene layouts.
It integrates a Spatial-Temporal Diffusion Transformer with ControlNet to conditionally synthesize realistic training data for addressing long-tail failure modes in autonomous driving.
DriveSora enhances safety-critical planning by reducing collision rates and L2 errors, outperforming conventional retrieval- and augmentation-based methods.

DriveSora is a diffusion-based video generation model designed to produce spatiotemporally consistent, multi-view driving videos conditioned on structured 3D scene layouts, with the principal application of addressing long-tail failure modes in autonomous driving end-to-end (E2E) planning systems. It is the generative core of the CorrectAD pipeline, facilitating fully automated, controllable training data synthesis to correct rare, safety-critical errors that conventional retrieval- or augmentation-based methods cannot effectively address (Ma et al., 17 Nov 2025).

1. Motivation and Problem Setting

Data-driven E2E planners for autonomous vehicles, such as UniAD and VAD, exhibit pronounced brittleness in handling rare “long-tail” scenarios—characterized by low-visibility, dense traffic, and corner cases—leading to safety violations including egocentric collisions and infractions. These failure cases are formalized as:

$D^{fail} = \{ (X, Y) \in D^{train} \mid \exists t \leq T_{e2e}, \exists j: \| p_{ego}(t) - p_{other}^j(t) \| < \epsilon \}$

with $\epsilon$ as a safety threshold. Conventional retrieval-based approaches (e.g., AIDE) rely solely on existing data, lacking the ability to synthesize novel conditions or exercise fine-grained control. DriveSora addresses this limitation by enabling targeted, high-fidelity video generation precisely aligned with the underrepresented or hazardous scenarios identified by a “PM-Agent.”

Within CorrectAD, DriveSora’s role is to synthesize multi-view video data $X^{gen}$ , precisely aligned with semantic and geometric requirements $R = (c, e)$ —a scene caption and structured 3D layout—supplied by the PM-Agent. This enables direct, iterative self-correction of E2E planners by expanding the effective support of the training distribution with diverse, realistic, and annotated examples tailored to observed failure modes (Ma et al., 17 Nov 2025).

2. Model Architecture and Conditional Generation

DriveSora extends the Spatial-Temporal Diffusion Transformer (STDiT) architecture to multi-view, structured-layout-conditioned video synthesis, supporting precise control over scene content at both the semantic and pixel levels.

2.1 Conditioning Pipeline

Semantic Input: Scene caption $c$ (from a Vision-LLM) is embedded via a T5 encoder $E_{text}(c)$ .
3D Layout Encoding:
- Foreground ( $e^{fore}$ ): Instances encoded as tuples $\{ b_n, m_n, u_n, v_n \}_n$ , representing bounding boxes, headings, instance IDs, and dense captions, are mapped to embedding vectors via Fourier features and MLPs: $\beta^{box} = \text{MLP}(\text{Fe}(B) + \text{Fe}(M) + \text{Fe}(U) + E_{text}(V))$ .
- Background ( $e^{back}$ ): Rasterized colored road maps, encoded via a variational autoencoder into $\beta^{road} = E_{image}(e^{back})$ .
Noise Latents: $z_{in} \sim \mathcal{N}(0, I)$ .

2.2 Core Diffusion and ControlNet Integration

STDiT Backbone applies self- and cross-attention over time, modalities, and spatial grids. Cross-attention aggregates text and layout control ( $k = [\beta^{box}; \beta^{text}]$ ).
ControlNet Transformer: A trainable copy of early STDiT blocks injects road layout control, with outputs added to the main generative pathway through zero-initialized convolutions, ensuring unbiased gradient flow at initialization.

2.3 Multi-View Spatial Attention

Spatial consistency across camera views is enforced by a parameter-free mechanism: latents are reshaped to $[B \cdot T, V \cdot S, C]$ , enabling attention to operate across all views at each timestamp, mixing spatial information and maintaining global coherence without additional model parameters.

2.4 Classifier-Free Guidance (CFG)

CFG operates by randomly dropping each condition (box, road, text) or all simultaneously with $p=0.05$ during training. At inference, denoised predictions are linearly combined:

$\hat{G}(z, e^{fore}, e^{back}, c) = G(z, \phi, \phi, \phi) + \lambda_{text} [G(z, \phi, \phi, c) - G(z, \phi, \phi, \phi)] + \lambda_{back} [G(z, \phi, e^{back}, c) - G(z, \phi, \phi, c)] + \lambda_{fore} [G(z, e^{fore}, e^{back}, c) - G(z, \phi, e^{back}, c)]$

with $(\lambda_{text}, \lambda_{back}, \lambda_{fore}) = (7.0, 2.0, 2.0)$ .

3. Diffusion Process and Training Objective

3.1 Forward and Reverse Processes

Forward Noising: At each timestep $t$ , noise is added according to:

$q(x_t \mid x_{t-1}) = \mathcal{N}(x_t; \sqrt{1 - \beta_t} x_{t-1}, \beta_t I), \quad q(x_t \mid x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I)$

with $\alpha_t = 1 - \beta_t$ , $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$ .

Reverse Denoising: The generative process is parameterized as:

$p_\theta(x_{t-1} \mid x_t, c) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t, c), \Sigma_t I)$

3.2 Objective Function

Training minimizes the standard denoising score matching loss:

$L_{simple} = \mathbb{E}_{x_0, \epsilon, t} [ \| \epsilon - \epsilon_\theta(\sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, t, c) \|^2 ]$

No additional explicit KL-divergence or layout-specific losses are introduced. Temporal consistency arises from STDiT’s cross-frame attention. Spatial alignment emerges from architectural constraints and conditioning; no explicit $L_{layout}$ term is used.

4. Training Data, Hyperparameters, and Implementation

Datasets:
- nuScenes: 700 train, 150 val, 6-view, 12 Hz, 20 s clips.
- In-house: 3M train, 0.6M val, 6-view, 10 Hz, 15 s clips, with 36% lane-changes.
- All frames resized to $512 \times 512$ , using 16-frame clips.
Base Model: OpenSora 1.1 checkpoint, single-view finetune (30k steps), multi-view (25k steps), batch size 16, HybridAdam optimizer ( $2 \times 10^{-5}$ learning rate).
Classifier-Free Drop Rates: 5% per condition, 5% all dropped.
Inference: Rectified flow sampling, 30 steps, $\sim$ 4 s/sample on A800 GPU.
Compute: Training on 8 × A800 GPUs for $\sim$ 72 hours.

5. Integration with Agentic Self-Correction

DriveSora is embedded in the CorrectAD pipeline, forming the generative “Data Department.” The PM-Agent, leveraging GPT-4o/VLM, classifies failures (foreground, background, weather), generates multimodal requirements $R = \{(c_i, e_i)\}$ , which are then passed to DriveSora. Top- $K$ examples matching the textual description are retrieved from $D^{train}$ to refine conditioning inputs.

DriveSora generates videos $X^{gen}_i$ that contain, by construction, the intended 3D bounding boxes, map layouts, and scene semantics—eliminating the need for auxiliary annotation. These synthetic examples are merged with the training corpus to fine-tune any E2E planner ( $F$ ), such as UniAD, VAD, or proprietary models. The CorrectAD loop iteratively detects new failures and generates targeted data, measurably improving planner robustness (Ma et al., 17 Nov 2025).

6. Quantitative and Qualitative Performance

6.1 End-to-End Planner Improvements

On nuScenes with UniAD initialization:

L2 (Avg): from 1.02 m (AIDE) to 0.98 m (CorrectAD), $-4\%$
Collision (Avg): from 0.28% (AIDE) to 0.19% (CorrectAD), $-32\%$

On the in-house planner:

L2 (Avg): from 0.85 m (baseline) to 0.62 m (CorrectAD), $-27\%$
Hit Rate (Avg): from 0.77 to 0.82 ( $+5\%$ )

6.2 Video Generation Metrics (nuScenes val set)

Generator	FID ↓	CLIP ↑	FVD ↓	NDS ↑
MagicDrive-v2	20.91	85.25	94.84	35.79
Panacea	16.96	84.23	139.0	32.10
DriveSora	15.08	86.73	94.51	36.58

DriveSora demonstrates lower FID and FVD, and higher CLIP and NDS scores, indicating improved perceptual and compositional quality over state-of-the-art baselines.

6.3 Ablation and Qualitative Analyses

Combining PM-Agent and DriveSora yields maximal gains (e.g., L2=0.98 m, Collision=0.19%).
Multi-view spatial attention and multimodal prompting are critical for image quality.
CFG, with adaptive conditional dropout, is necessary for optimal metric performance.
Generator swap experiments confirm DriveSora’s superiority over Panacea for both video quality and downstream planning metrics.
Multiple CorrectionAD iterations progressively close the performance gap on failure distributions.

Qualitatively, DriveSora achieves superior spatiotemporal consistency, multi-view coherence, and precise instance/weather editing.

7. Limitations and Prospects

Current scope is limited to collision-type failures; planned extensions encompass lane violations and traffic infractions via richer benchmarks such as Bench2Drive and NAVSIM. DriveSora’s model size (1.1B parameters) and inference latency ( $\sim$ 4 s/frame) preclude on-demand data generation for some applications. Potential enhancements include lightweight samplers (e.g., SANA) and distillation into student models. Integration with closed-loop simulators (NAVSIM) shows additional Planner Domain Metric Score (PDMS) gains (+0.9). The conditional generation scheme is suggestive of broader applicability to other modalities (e.g., LiDAR, radar) within a generalized “OmniGen” framework (Ma et al., 17 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

CorrectAD: A Self-Correcting Agentic System to Improve End-to-end Planning in Autonomous Driving (2025)

DriveSora: Diffusion Model for Autonomous Videos

1. Motivation and Problem Setting

2. Model Architecture and Conditional Generation

2.1 Conditioning Pipeline

2.2 Core Diffusion and ControlNet Integration

2.3 Multi-View Spatial Attention

2.4 Classifier-Free Guidance (CFG)

3. Diffusion Process and Training Objective

3.1 Forward and Reverse Processes

3.2 Objective Function

4. Training Data, Hyperparameters, and Implementation

5. Integration with Agentic Self-Correction

6. Quantitative and Qualitative Performance

6.1 End-to-End Planner Improvements

6.2 Video Generation Metrics (nuScenes val set)

6.3 Ablation and Qualitative Analyses

7. Limitations and Prospects

Whiteboard

Follow Topic

Continue Learning

DriveSora: Diffusion Model for Autonomous Videos

1. Motivation and Problem Setting

2. Model Architecture and Conditional Generation

2.1 Conditioning Pipeline

2.2 Core Diffusion and ControlNet Integration

2.3 Multi-View Spatial Attention

2.4 Classifier-Free Guidance (CFG)

3. Diffusion Process and Training Objective

3.1 Forward and Reverse Processes

3.2 Objective Function

4. Training Data, Hyperparameters, and Implementation

5. Integration with Agentic Self-Correction

6. Quantitative and Qualitative Performance

6.1 End-to-End Planner Improvements

6.2 Video Generation Metrics (nuScenes val set)

6.3 Ablation and Qualitative Analyses

7. Limitations and Prospects

Whiteboard

Follow Topic

Continue Learning

Related Topics