DriveSora: Diffusion Model for Autonomous Videos
- DriveSora is a diffusion-based video generation model that produces spatiotemporally consistent, multi-view driving videos using structured 3D scene layouts.
- It integrates a Spatial-Temporal Diffusion Transformer with ControlNet to conditionally synthesize realistic training data for addressing long-tail failure modes in autonomous driving.
- DriveSora enhances safety-critical planning by reducing collision rates and L2 errors, outperforming conventional retrieval- and augmentation-based methods.
DriveSora is a diffusion-based video generation model designed to produce spatiotemporally consistent, multi-view driving videos conditioned on structured 3D scene layouts, with the principal application of addressing long-tail failure modes in autonomous driving end-to-end (E2E) planning systems. It is the generative core of the CorrectAD pipeline, facilitating fully automated, controllable training data synthesis to correct rare, safety-critical errors that conventional retrieval- or augmentation-based methods cannot effectively address (Ma et al., 17 Nov 2025).
1. Motivation and Problem Setting
Data-driven E2E planners for autonomous vehicles, such as UniAD and VAD, exhibit pronounced brittleness in handling rare “long-tail” scenarios—characterized by low-visibility, dense traffic, and corner cases—leading to safety violations including egocentric collisions and infractions. These failure cases are formalized as:
with as a safety threshold. Conventional retrieval-based approaches (e.g., AIDE) rely solely on existing data, lacking the ability to synthesize novel conditions or exercise fine-grained control. DriveSora addresses this limitation by enabling targeted, high-fidelity video generation precisely aligned with the underrepresented or hazardous scenarios identified by a “PM-Agent.”
Within CorrectAD, DriveSora’s role is to synthesize multi-view video data , precisely aligned with semantic and geometric requirements —a scene caption and structured 3D layout—supplied by the PM-Agent. This enables direct, iterative self-correction of E2E planners by expanding the effective support of the training distribution with diverse, realistic, and annotated examples tailored to observed failure modes (Ma et al., 17 Nov 2025).
2. Model Architecture and Conditional Generation
DriveSora extends the Spatial-Temporal Diffusion Transformer (STDiT) architecture to multi-view, structured-layout-conditioned video synthesis, supporting precise control over scene content at both the semantic and pixel levels.
2.1 Conditioning Pipeline
- Semantic Input: Scene caption (from a Vision-LLM) is embedded via a T5 encoder .
- 3D Layout Encoding:
- Foreground (): Instances encoded as tuples , representing bounding boxes, headings, instance IDs, and dense captions, are mapped to embedding vectors via Fourier features and MLPs: .
- Background (): Rasterized colored road maps, encoded via a variational autoencoder into .
- Noise Latents: .
2.2 Core Diffusion and ControlNet Integration
- STDiT Backbone applies self- and cross-attention over time, modalities, and spatial grids. Cross-attention aggregates text and layout control ().
- ControlNet Transformer: A trainable copy of early STDiT blocks injects road layout control, with outputs added to the main generative pathway through zero-initialized convolutions, ensuring unbiased gradient flow at initialization.
2.3 Multi-View Spatial Attention
Spatial consistency across camera views is enforced by a parameter-free mechanism: latents are reshaped to , enabling attention to operate across all views at each timestamp, mixing spatial information and maintaining global coherence without additional model parameters.
2.4 Classifier-Free Guidance (CFG)
CFG operates by randomly dropping each condition (box, road, text) or all simultaneously with during training. At inference, denoised predictions are linearly combined:
with .
3. Diffusion Process and Training Objective
3.1 Forward and Reverse Processes
- Forward Noising: At each timestep , noise is added according to:
with , .
- Reverse Denoising: The generative process is parameterized as:
3.2 Objective Function
Training minimizes the standard denoising score matching loss:
No additional explicit KL-divergence or layout-specific losses are introduced. Temporal consistency arises from STDiT’s cross-frame attention. Spatial alignment emerges from architectural constraints and conditioning; no explicit term is used.
4. Training Data, Hyperparameters, and Implementation
- Datasets:
- nuScenes: 700 train, 150 val, 6-view, 12 Hz, 20 s clips.
- In-house: 3M train, 0.6M val, 6-view, 10 Hz, 15 s clips, with 36% lane-changes.
- All frames resized to , using 16-frame clips.
- Base Model: OpenSora 1.1 checkpoint, single-view finetune (30k steps), multi-view (25k steps), batch size 16, HybridAdam optimizer ( learning rate).
- Classifier-Free Drop Rates: 5% per condition, 5% all dropped.
- Inference: Rectified flow sampling, 30 steps, 4 s/sample on A800 GPU.
- Compute: Training on 8 × A800 GPUs for 72 hours.
5. Integration with Agentic Self-Correction
DriveSora is embedded in the CorrectAD pipeline, forming the generative “Data Department.” The PM-Agent, leveraging GPT-4o/VLM, classifies failures (foreground, background, weather), generates multimodal requirements , which are then passed to DriveSora. Top- examples matching the textual description are retrieved from to refine conditioning inputs.
DriveSora generates videos that contain, by construction, the intended 3D bounding boxes, map layouts, and scene semantics—eliminating the need for auxiliary annotation. These synthetic examples are merged with the training corpus to fine-tune any E2E planner (), such as UniAD, VAD, or proprietary models. The CorrectAD loop iteratively detects new failures and generates targeted data, measurably improving planner robustness (Ma et al., 17 Nov 2025).
6. Quantitative and Qualitative Performance
6.1 End-to-End Planner Improvements
On nuScenes with UniAD initialization:
- L2 (Avg): from 1.02 m (AIDE) to 0.98 m (CorrectAD),
- Collision (Avg): from 0.28% (AIDE) to 0.19% (CorrectAD),
On the in-house planner:
- L2 (Avg): from 0.85 m (baseline) to 0.62 m (CorrectAD),
- Hit Rate (Avg): from 0.77 to 0.82 ()
6.2 Video Generation Metrics (nuScenes val set)
| Generator | FID ↓ | CLIP ↑ | FVD ↓ | NDS ↑ |
|---|---|---|---|---|
| MagicDrive-v2 | 20.91 | 85.25 | 94.84 | 35.79 |
| Panacea | 16.96 | 84.23 | 139.0 | 32.10 |
| DriveSora | 15.08 | 86.73 | 94.51 | 36.58 |
DriveSora demonstrates lower FID and FVD, and higher CLIP and NDS scores, indicating improved perceptual and compositional quality over state-of-the-art baselines.
6.3 Ablation and Qualitative Analyses
- Combining PM-Agent and DriveSora yields maximal gains (e.g., L2=0.98 m, Collision=0.19%).
- Multi-view spatial attention and multimodal prompting are critical for image quality.
- CFG, with adaptive conditional dropout, is necessary for optimal metric performance.
- Generator swap experiments confirm DriveSora’s superiority over Panacea for both video quality and downstream planning metrics.
- Multiple CorrectionAD iterations progressively close the performance gap on failure distributions.
Qualitatively, DriveSora achieves superior spatiotemporal consistency, multi-view coherence, and precise instance/weather editing.
7. Limitations and Prospects
Current scope is limited to collision-type failures; planned extensions encompass lane violations and traffic infractions via richer benchmarks such as Bench2Drive and NAVSIM. DriveSora’s model size (1.1B parameters) and inference latency (4 s/frame) preclude on-demand data generation for some applications. Potential enhancements include lightweight samplers (e.g., SANA) and distillation into student models. Integration with closed-loop simulators (NAVSIM) shows additional Planner Domain Metric Score (PDMS) gains (+0.9). The conditional generation scheme is suggestive of broader applicability to other modalities (e.g., LiDAR, radar) within a generalized “OmniGen” framework (Ma et al., 17 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free