DiffusionDrive: Diffusion Models for Driving

Updated 29 December 2025

DiffusionDrive is a diffusion model–based framework featuring truncated, anchor-guided policies for multi-modal trajectory planning in autonomous driving.
It employs a cascade transformer decoder integrated with a deep visual backbone to fuse camera and BEV LiDAR data, enabling real-time inference at 45 FPS.
Recent enhancements in DiffusionDriveV2 leverage reinforcement learning to balance diversity and quality, achieving state-of-the-art performance on planning benchmarks.

DiffusionDrive refers to a family of diffusion model–based frameworks and algorithms for autonomous driving, encompassing both trajectory planning and driving scene generation. In end-to-end driving, DiffusionDrive approaches are notable for modeling the diverse, multi-modal distribution over action or trajectory spaces while enabling real-time, high-quality planning. The term most prominently designates the class of truncated and anchored diffusion models for trajectory generation (Liao et al., 2024), but is also used in scene synthesis (Pronovost et al., 2023) and in diffusion-powered reinforcement learning frameworks (Song et al., 5 Jul 2025). Recent variants, including DiffusionDriveV2 (Zou et al., 8 Dec 2025), extend the approach to reinforce the diversity–quality trade-off using reinforcement learning techniques, achieving state-of-the-art performance on planning-oriented benchmarks.

The core of DiffusionDrive is a truncated diffusion trajectory policy where the forward process generates noise-corrupted trajectories from a Gaussian mixture of anchoring trajectories, each representing a prototypical driving intent (e.g., going straight, turning). Each anchor $\mathbf{a}_k$ is learned via clustering on expert demonstrations. The forward noising process for anchor $k$ is defined as: $\tau_k^i = \sqrt{\bar\alpha_i} \mathbf{a}_k + \sqrt{1-\bar\alpha_i}\, \epsilon,\quad \epsilon \sim \mathcal{N}(0,I)$ where $\bar\alpha_i$ is the cumulative product of scheduling parameters up to step $i$ . The reverse (denoising) process is implemented using a cascade transformer-based decoder, accomplished in as few as 2 steps (orders of magnitude fewer than vanilla diffusion), by leveraging the high-quality inductive bias of anchor-based initialization (Liao et al., 2024).

By drawing anchor candidates and running short reverse diffusion chains, the policy can sample a diverse set of plausible trajectories consistent with both the expert data manifold and real-time latency constraints.

2. Model Architecture and Cascade Decoding

The canonical DiffusionDrive architecture employs a deep visual backbone (e.g., aligned ResNet-34) that digests fused camera and rasterized BEV LiDAR inputs. The context features are injected into a lightweight cascade transformer decoder, which forms the core diffusion model. Each cascade layer executes:

Deformable cross-attention between trajectory waypoints and BEV features.
Cross-attention across agent and map queries for scene-level context fusion.
Feed-forward refinement followed by timestep modulation.
An MLP head outputs both an incremental trajectory offset and a confidence score.

All parameters are shared across diffusion steps to maximize parameter efficiency and facilitate real-time inference (e.g., 7.6 ms total planning time for 2 steps, 45 FPS on NVIDIA 4090) (Liao et al., 2024).

3. Training Objectives, Loss Functions, and Anchoring

Training leverages the semi-supervised paradigm where ground-truth trajectories are clustered to create anchors. Each training sample involves:

Forward noising of all anchor prototypes.
The decoder predicts, for each anchor, (a) the reconstruction of the true trajectory, and (b) a confidence that this anchor matches the ground truth.
Loss is the weighted sum of L1 regression on the closest anchor plus binary cross-entropy for the anchor selection:

$\mathcal{L} = \sum_{k=1}^{N_{\text{anchor}}} \left[ y_k \mathcal{L}_{\text{rec}}(\hat{y}_k,\tau_{gt}) + \lambda \text{BCE}(\hat{s}_k, y_k) \right]$

where $y_k=1$ for the anchor nearest to $\tau_{gt}$ and 0 otherwise. Notably, training makes no use of KL or adversarial terms, relying on the diffusion backbone for expressivity.

4. Reinforcement Learning Extensions and DiffusionDriveV2

The second-generation DiffusionDriveV2 (Zou et al., 8 Dec 2025) incorporates reinforcement learning to address the "diversity–quality dilemma" present in pure imitation anchored diffusion. Two technical innovations drive V2:

Scale-Adaptive Multiplicative Noise: Rather than standard additive exploration, trajectory perturbations are multiplicative, preserving motion smoothness across spatial scales.
GRPO (Generalized Reinforced Policy Optimization): The policy gradient is computed separately within each anchor (intra-anchor), avoiding invalid comparisons across qualitatively different anchors (e.g., turning vs. straight modes). Inter-anchor truncated GRPO further penalizes global unsafe candidates with strong negative rewards by setting

$A^{k,i}_{\text{trunc}} = \begin{cases} -1 & \text{if } \tau_0^{k,i} \text{ collides} \ \max(0,A^{k,i}) & \text{otherwise} \end{cases}$

This approach ensures both safety and multi-modality are preserved, and that mode-collapse is mitigated.

Imitation loss is added with small weight to anchor the policy near demonstrated behavior, but the reinforcement loss dominates policy improvement.

5. Performance Evaluation and Diversity Metrics

DiffusionDrive and its variants are benchmarked using closed-loop planning scores such as PDMS/EPDMS on the NAVSIM datasets and scenario-based metrics (no-at-fault collisions, drivable area compliance, time-to-collision, comfort, ego progress). Key quantitative results include:

Model	PDMS (NAVSIM v1)	EPDMS (NAVSIM v2)	Diversity ( $\mathcal{D}$ )	FPS
DiffusionDrive	88.1	–	74%	45
DiffusionDriveV2	91.2	85.5	comparable	45
Hydra-MDP	86.5	–	–	–
TransDiffuser	94.85*	–	70% (mean IoU)	>20

(*TransDiffuser is not anchor-based and employs a decorrelation mechanism (Jiang et al., 14 May 2025).)

DiffusionDrive outperforms anchor-based baselines with orders-of-magnitude fewer anchor modes and far fewer denoising steps (Liao et al., 2024, Zou et al., 8 Dec 2025). Average diversity (e.g., mean pairwise distance normalized by trajectory scale) rises sharply relative to direct regression or single-path diffusion competitors (Song et al., 5 Jul 2025). Diversity is further quantified using normalized pairwise spread metrics at each prediction time step, with DiffusionDriveV2 and other RL-constrained extensions attaining high Div $^t$ and low collision rates even in adversarial or corrupted scenarios.

6. Applications, Extensions, and Other Variants

DiffusionDrive refers primarily to trajectory planners, but conditional diffusion generative models have also been applied to driving scene synthesis (Pronovost et al., 2023) and to enhanced simulator visual realism (Bu et al., 2024), leveraging similar noise-injection and denoising procedures, and benefiting from regional/contextual conditioning or plug-and-play visual adapters.

Within the planning domain, extensions include:

Flexible Classifier Guidance: Score modifications at inference to enforce safety (collision avoidance, lane-keeping), speed, or comfort constraints, using differentiable cost gradients injected into the denoising process (Zheng et al., 26 Jan 2025).
Multi-Head Diffusion and LLM Integration: Strategy-level planning with multi-head decoders enables user-guided, instruction-driven style switching, integrated with LLM prompts for real-time policy selection (Ding et al., 23 Aug 2025).
Reinforced Diffusion (DIVER): Group Relative PPO (GRPO) and reward-based diffusion training achieves improved diversity and robustness on NAVSIM and nuScenes, reducing mode collapse (Song et al., 5 Jul 2025).

7. Limitations and Future Research Directions

Current limitations of DiffusionDrive-family planners include:

Fixed Anchor Set: Anchors are typically obtained by offline clustering (e.g., k-means) and are not adapted online or during RL finetuning.
Discrete Mode Design: Lar ger or continuous anchor sets, or multi-heads parameterized by continuous intent vectors, could provide richer behavioral coverage.
Inference Latency: While truncated step schedules achieve real-time performance, further speed gains could be realized via consistency models or ODE-based solvers.
LLM/Instruction Dependence: LLM integration for style selection introduces dependence on prompt quality and LLM reliability.
Safety and Exploration Trade-offs: Tuning RL rewards and exploration perturbations is non-trivial in long-horizon, high-dimensional trajectory spaces.

Possible research extensions include online anchor adaptation, joint perception–planning training, end-to-end LLM-to-diffusion fine-tuning, improved fast-consistency samplers, and tighter integration of world model–based prediction for complex multi-agent and open-world settings (Zou et al., 8 Dec 2025, Ding et al., 23 Aug 2025).

References

(Liao et al., 2024) (DiffusionDrive: Truncated Diffusion Model for End-to-End Autonomous Driving)
(Zou et al., 8 Dec 2025) (DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving)
(Song et al., 5 Jul 2025) (DIVER)
(Zheng et al., 26 Jan 2025) (Diffusion-Based Planning for Autonomous Driving with Flexible Guidance)
(Ding et al., 23 Aug 2025) (Drive As You Like: Strategy-Level Motion Planning Based on A Multi-Head Diffusion Model)
(Jiang et al., 14 May 2025) (TransDiffuser)
(Pronovost et al., 2023) (Scene Diffusion)
(Bu et al., 2024) (DRIVE: Diffusion-based Realism Improvement for Virtual Environments)