DiffusionDriveV2 Autonomous Driving Framework

Updated 16 December 2025

DiffusionDriveV2 is a reinforcement learning-constrained truncated diffusion framework that uses Gaussian mixture modeling and anchored intent representations for multimodal trajectory generation.
It employs scale-adaptive multiplicative noise to foster robust exploration, ensuring diverse and high-quality trajectories in complex driving scenarios.
The framework integrates intra-anchor and inter-anchor GRPO, achieving state-of-the-art closed-loop performance on NAVSIM benchmarks while ensuring safety.

DiffusionDriveV2 is a reinforcement learning-constrained truncated diffusion modeling framework for end-to-end autonomous driving, designed to resolve the persistent “diversity–quality dilemma” that arises when leveraging generative diffusion planners. It accomplishes this via a combination of Gaussian mixture modeling, anchored intent representations, scale-adaptive multiplicative exploration noise, and a dual-level reinforcement learning objective, achieving state-of-the-art closed-loop driving performance while preserving trajectory multimodality (Zou et al., 8 Dec 2025).

1. End-to-End Trajectory Generation with Anchored Diffusion

The autonomous driving policy in DiffusionDriveV2 is formulated as a mapping

$\pi_\theta: z \mapsto \tau = \{(x_n, y_n)\}_{n=1}^{N_f}$

where $z$ represents processed sensor inputs and $\tau$ is a sequence of future waypoints over a fixed planning horizon $N_f$ . Traditional imitation learning approaches produce single-mode outputs, failing to capture real-world multimodal intent. Vanilla diffusion models, though multimodal, suffer from mode collapse, generating conservative, mean-like behaviors on diverse driving scenarios.

DiffusionDrive introduced $N_{\text{anchor}}$ discrete intent anchors $\{a^k\}$ to partition the action space—e.g., “turn left,” “go straight”—with a truncated diffusion decoder acting over a Gaussian Mixture Model (GMM) prior. Supervision, however, was limited to the anchor closest to the ground truth, leaving other modes unconstrained and yielding low-quality, sometimes invalid trajectories.

DiffusionDriveV2 extends this strategy by incorporating reinforcement learning (RL) based constraints, both to penalize unsafe or low-quality modes and to encourage exploration toward higher-reward behaviors. The framework’s core advancements are its use of scale-adaptive multiplicative exploration noise, Intra-Anchor Group Relative Policy Optimization (GRPO), and Inter-Anchor Truncated GRPO.

2. Gaussian Mixture Modeling and Anchored Trajectory Diffusion

The conditional distribution over future trajectories, given perception $z$ , is modeled as a Gaussian mixture: $p(\tau \mid z) = \sum_{k=1}^{N_{\text{anchor}}} s(a^k \mid z) \, \mathcal{N}(\tau ; a^k + \mu^k(z), \Sigma^k(z))$ with $s(a^k\mid z)$ as softmax-weighted intent probabilities, $\mu^k(z)$ as scene-dependent offsets, and $\Sigma^k(z)$ as offsets' covariances. At inference, each anchor $a^k$ initiates a truncated diffusion chain: $\tau_t^k = \sqrt{\bar\alpha_t} a^k + \sqrt{1-\bar\alpha_t}\, \epsilon,\quad \epsilon \sim \mathcal{N}(0, I),\quad t=1 \ldots T_{\text{trunc}}$ producing a pool of candidate trajectories that reflect distinct high-level driving intents.

3. Scale-Adaptive Multiplicative Noise for Exploration

Exploration in continuous control is a central challenge. DiffusionDriveV2 replaces standard additive Gaussian noise with two-degree-of-freedom multiplicative noise, tailored for vehicle trajectory planning: $\tau' = (I + \epsilon_{\text{mul}}) \tau$ where

$\epsilon_{\text{mul}} = \mathrm{diag}(\epsilon_{\text{long}}, \epsilon_{\text{lat}}), \quad \epsilon_{\text{long}}, \epsilon_{\text{lat}} \sim \mathcal{N}(0, \sigma^2)$

This scheme injects controlled scaling in both longitudinal and lateral directions. During RL-based training (DDIM/DDPM noise scale $\eta=1$ ), a minimum standard deviation constraint $\sigma_{\min} \ge 0.04$ is enforced for both dimensions, preventing entropy collapse and fostering robust exploration. Deterministic sampling ( $\eta=0$ ) is used during inference.

4. Reinforcement Learning Constraints: Intra- and Inter-Anchor GRPO

Trajectories for each anchor are generated by applying a Gaussian denoising policy via the truncated diffusion chain: $\pi_\theta(\tau_{t-1}^k \mid \tau_t^k, z, a^k) = \mathcal{N}(\tau_{t-1}^k ; \mu_\theta(\tau_t^k, t, z, a^k), \eta(1-\alpha_t)I)$

Intra-Anchor Group Relative Policy Optimization (GRPO)

For each anchor $k$ , $G$ sample trajectories $\{\tau_0^{k,i}\}_{i=1}^G$ are generated. Relative advantages are calculated within the anchor group: $A^{k,i} = \frac{r(\tau_0^{k,i}) - \overline{r_k}}{\mathrm{std}(\{r(\tau_0^{k,1}), ..., r(\tau_0^{k,G})\})},$ with $\overline{r_k}$ the mean reward across group samples. The RL gradient for training is

$\nabla_\theta J = \mathbb{E}\Big[\sum_{t=1}^{T_{\text{trunc}}} \gamma^{t-1} \nabla_\theta \log\pi_\theta(\tau_{t-1}^{k,i} \mid \tau_t^{k,i}) A^{k,i}\Big]$

and the intra-anchor loss is

$L_{RL} = -\frac{1}{N_{\text{anchor}}} \sum_{k=1}^{N_{\text{anchor}}} \frac{1}{G} \sum_{i=1}^G \frac{1}{T_{\text{trunc}}} \sum_{t=1}^{T_{\text{trunc}}} \gamma^{t-1} \log \pi_\theta(\tau_{t-1}^{k,i} \mid \tau_t^{k,i}) A^{k,i}$

Inter-Anchor Truncated GRPO

To provide a global reward signal without mode collapse, negative intra-anchor advantages are truncated to zero; colliding trajectories are assigned a hard penalty of $-1$ : $A_{\text{trunc}}^{k,i} = \begin{cases} -1, & \text{if } \tau_0^{k,i} \text{ collides} \ \max(0, A^{k,i}), & \text{otherwise} \end{cases}$ These are substituted into the RL loss.

5. Learning Algorithm and Objective Structure

The overall training objective sums the RL loss and a regularized imitation learning (IL) loss computed over the denoising-reconstruction and anchor-classification binary cross-entropy (BCE): $L_{\text{total}} = L_{RL} + \lambda L_{IL}, \quad \lambda \in (0,1)$ High-level learning proceeds as follows:

for epoch in 1..E_rl:
    for batch in dataset:
        features ← PerceptionNet(batch.sensors)
        for each anchor k:
            for i = 1..G:
                τ_T ← noisy_anchor(a^k; features)
                τ_0 ← RunTruncatedDiffusion(τ_T; θ, η=1, multiplicative_noise)
                r[k,i] ← SimulatorReward(τ_0)
        for each anchor k:
            compute A^{k,i} and then A_trunc^{k,i}
        L_RL ← sum_{k,i,t} [ -γ^{t-1} log π_θ(·) A_trunc^{k,i} ]
        L_IL ← imitation‐learning loss on GT anchor
        θ ← θ − AdamW(∇θ[L_RL + λ L_IL])

Following RL-stage training, a two-stage mode selector is trained on frozen generator outputs for 20 epochs, using BCE and margin-rank losses.

6. Experimental Protocol and Metrics

Key experimental details include:

Datasets: NAVSIM v1 (1,192 train, 136 test scenes) and NAVSIM v2 (with extended closed-loop metrics).
Inputs: ResNet-34 backbone aligned on BEV-LiDAR plus three front cameras (1024×256).
PDMS is the main closed-loop score: $\mathrm{PDMS} = \mathrm{NC} \times \mathrm{DAC} \times \frac{5\,\mathrm{EP} + 5\,\mathrm{TTC} + 2\,C}{12}$ where NC: no-at-fault collisions, DAC: drivable-area compliance, EP: ego progress, TTC: time to collision, C: comfort.
Extended score on v2, EPDMS: $\mathrm{EPDMS} = \mathrm{NC} \times \mathrm{DAC} \times \mathrm{DDC} \times \mathrm{TL} \times \frac{5\,\mathrm{TTC} + 2\,C + 5\,\mathrm{EP} + 5\,\mathrm{LK} + 5\,\mathrm{EC}}{22}$
Diversity: average normalized pairwise waypoint distance: $Div_{\text{raw}}^n = \tfrac{2}{M(M-1)}\sum_{i<j}\|p_n^i - p_n^j\|_2,\quad Div^n = \min\bigg(1, \frac{Div_{\text{raw}}^n}{\epsilon + \tfrac{1}{M} \sum_m\|p_n^m\|_2}\bigg)$

7. Performance Evaluation and Ablation Studies

DiffusionDriveV2 achieves state-of-the-art performance on both NAVSIM v1 and v2. Key results:

Method	Div.	PDMS@1	PDMS@5	PDMS@10
Transfuser $_{TD}$	0.1	85.7	—	—
DiffusionDrive	42.3	—	—	75.3
DiffusionDriveV2	30.3	94.9	91.1	84.4

DiffusionDriveV2 attains 91.2 PDMS on NAVSIM v1 (+3.1 above DiffusionDrive, +2.9 above RL-based DIVER), and 85.5 EPDMS on NAVSIM v2.

Ablations on NAVSIM v1 (RL stage only) indicate:

Additive vs. multiplicative exploration noise: PDMS 89.7 → 90.1
Without Intra-Anchor GRPO: PDMS 89.2 → 90.1
Without Inter-Anchor truncation: PDMS 89.5 → 90.1

This suggests that each RL component contributes to the final closed-loop safety and multimodal intent retention.

8. Implementation and Reproducibility Details

Reinforcement learning is conducted for 10 epochs with batch size 512, AdamW optimizer (learning rate $2 \times 10^{-4}$ , weight decay $1 \times 10^{-4}$ ), warmup 10%, cosine decay, and $\gamma=0.8$ . The minimum multiplicative noise standard deviation is set to 0.04. The selector stage uses the same optimizer configuration, 20 epochs, and data augmentation with multiplicative noise sampled in $[0.1, 0.2]$ . Inference employs two denoising steps as in DiffusionDrive. Code and models are publicly available at https://github.com/hustvl/DiffusionDriveV2.

In conclusion, DiffusionDriveV2 enforces RL constraints at both intra-anchor and inter-anchor levels with scale-adaptive noise and a two-stage selector, achieving an optimal trade-off between diversity and consistent trajectory quality. It sets new state-of-the-art scores on two closed-loop NAVSIM benchmarks with strong multimodality and rigorous safety (Zou et al., 8 Dec 2025).

Markdown Upgrade to Chat

References (1)

DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DiffusionDriveV2.

DiffusionDriveV2 Autonomous Driving Framework

1. End-to-End Trajectory Generation with Anchored Diffusion

2. Gaussian Mixture Modeling and Anchored Trajectory Diffusion

3. Scale-Adaptive Multiplicative Noise for Exploration

4. Reinforcement Learning Constraints: Intra- and Inter-Anchor GRPO

Intra-Anchor Group Relative Policy Optimization (GRPO)

Inter-Anchor Truncated GRPO

5. Learning Algorithm and Objective Structure

6. Experimental Protocol and Metrics

7. Performance Evaluation and Ablation Studies

8. Implementation and Reproducibility Details

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

DiffusionDriveV2 Autonomous Driving Framework

1. End-to-End Trajectory Generation with Anchored Diffusion

2. Gaussian Mixture Modeling and Anchored Trajectory Diffusion

3. Scale-Adaptive Multiplicative Noise for Exploration

4. Reinforcement Learning Constraints: Intra- and Inter-Anchor GRPO

Intra-Anchor Group Relative Policy Optimization (GRPO)

Inter-Anchor Truncated GRPO

5. Learning Algorithm and Objective Structure

6. Experimental Protocol and Metrics

7. Performance Evaluation and Ablation Studies

8. Implementation and Reproducibility Details

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research