Papers
Topics
Authors
Recent
Search
2000 character limit reached

DiffusionDriveV2 Autonomous Driving Framework

Updated 16 December 2025
  • DiffusionDriveV2 is a reinforcement learning-constrained truncated diffusion framework that uses Gaussian mixture modeling and anchored intent representations for multimodal trajectory generation.
  • It employs scale-adaptive multiplicative noise to foster robust exploration, ensuring diverse and high-quality trajectories in complex driving scenarios.
  • The framework integrates intra-anchor and inter-anchor GRPO, achieving state-of-the-art closed-loop performance on NAVSIM benchmarks while ensuring safety.

DiffusionDriveV2 is a reinforcement learning-constrained truncated diffusion modeling framework for end-to-end autonomous driving, designed to resolve the persistent “diversity–quality dilemma” that arises when leveraging generative diffusion planners. It accomplishes this via a combination of Gaussian mixture modeling, anchored intent representations, scale-adaptive multiplicative exploration noise, and a dual-level reinforcement learning objective, achieving state-of-the-art closed-loop driving performance while preserving trajectory multimodality (Zou et al., 8 Dec 2025).

1. End-to-End Trajectory Generation with Anchored Diffusion

The autonomous driving policy in DiffusionDriveV2 is formulated as a mapping

πθ:zτ={(xn,yn)}n=1Nf\pi_\theta: z \mapsto \tau = \{(x_n, y_n)\}_{n=1}^{N_f}

where zz represents processed sensor inputs and τ\tau is a sequence of future waypoints over a fixed planning horizon NfN_f. Traditional imitation learning approaches produce single-mode outputs, failing to capture real-world multimodal intent. Vanilla diffusion models, though multimodal, suffer from mode collapse, generating conservative, mean-like behaviors on diverse driving scenarios.

DiffusionDrive introduced NanchorN_{\text{anchor}} discrete intent anchors {ak}\{a^k\} to partition the action space—e.g., “turn left,” “go straight”—with a truncated diffusion decoder acting over a Gaussian Mixture Model (GMM) prior. Supervision, however, was limited to the anchor closest to the ground truth, leaving other modes unconstrained and yielding low-quality, sometimes invalid trajectories.

DiffusionDriveV2 extends this strategy by incorporating reinforcement learning (RL) based constraints, both to penalize unsafe or low-quality modes and to encourage exploration toward higher-reward behaviors. The framework’s core advancements are its use of scale-adaptive multiplicative exploration noise, Intra-Anchor Group Relative Policy Optimization (GRPO), and Inter-Anchor Truncated GRPO.

2. Gaussian Mixture Modeling and Anchored Trajectory Diffusion

The conditional distribution over future trajectories, given perception zz, is modeled as a Gaussian mixture: p(τz)=k=1Nanchors(akz)N(τ;ak+μk(z),Σk(z))p(\tau \mid z) = \sum_{k=1}^{N_{\text{anchor}}} s(a^k \mid z) \, \mathcal{N}(\tau ; a^k + \mu^k(z), \Sigma^k(z)) with s(akz)s(a^k\mid z) as softmax-weighted intent probabilities, μk(z)\mu^k(z) as scene-dependent offsets, and Σk(z)\Sigma^k(z) as offsets' covariances. At inference, each anchor aka^k initiates a truncated diffusion chain: τtk=αˉtak+1αˉtϵ,ϵN(0,I),t=1Ttrunc\tau_t^k = \sqrt{\bar\alpha_t} a^k + \sqrt{1-\bar\alpha_t}\, \epsilon,\quad \epsilon \sim \mathcal{N}(0, I),\quad t=1 \ldots T_{\text{trunc}} producing a pool of candidate trajectories that reflect distinct high-level driving intents.

3. Scale-Adaptive Multiplicative Noise for Exploration

Exploration in continuous control is a central challenge. DiffusionDriveV2 replaces standard additive Gaussian noise with two-degree-of-freedom multiplicative noise, tailored for vehicle trajectory planning: τ=(I+ϵmul)τ\tau' = (I + \epsilon_{\text{mul}}) \tau where

ϵmul=diag(ϵlong,ϵlat),ϵlong,ϵlatN(0,σ2)\epsilon_{\text{mul}} = \mathrm{diag}(\epsilon_{\text{long}}, \epsilon_{\text{lat}}), \quad \epsilon_{\text{long}}, \epsilon_{\text{lat}} \sim \mathcal{N}(0, \sigma^2)

This scheme injects controlled scaling in both longitudinal and lateral directions. During RL-based training (DDIM/DDPM noise scale η=1\eta=1), a minimum standard deviation constraint σmin0.04\sigma_{\min} \ge 0.04 is enforced for both dimensions, preventing entropy collapse and fostering robust exploration. Deterministic sampling (η=0\eta=0) is used during inference.

4. Reinforcement Learning Constraints: Intra- and Inter-Anchor GRPO

Trajectories for each anchor are generated by applying a Gaussian denoising policy via the truncated diffusion chain: πθ(τt1kτtk,z,ak)=N(τt1k;μθ(τtk,t,z,ak),η(1αt)I)\pi_\theta(\tau_{t-1}^k \mid \tau_t^k, z, a^k) = \mathcal{N}(\tau_{t-1}^k ; \mu_\theta(\tau_t^k, t, z, a^k), \eta(1-\alpha_t)I)

Intra-Anchor Group Relative Policy Optimization (GRPO)

For each anchor kk, GG sample trajectories {τ0k,i}i=1G\{\tau_0^{k,i}\}_{i=1}^G are generated. Relative advantages are calculated within the anchor group: Ak,i=r(τ0k,i)rkstd({r(τ0k,1),...,r(τ0k,G)}),A^{k,i} = \frac{r(\tau_0^{k,i}) - \overline{r_k}}{\mathrm{std}(\{r(\tau_0^{k,1}), ..., r(\tau_0^{k,G})\})}, with rk\overline{r_k} the mean reward across group samples. The RL gradient for training is

θJ=E[t=1Ttruncγt1θlogπθ(τt1k,iτtk,i)Ak,i]\nabla_\theta J = \mathbb{E}\Big[\sum_{t=1}^{T_{\text{trunc}}} \gamma^{t-1} \nabla_\theta \log\pi_\theta(\tau_{t-1}^{k,i} \mid \tau_t^{k,i}) A^{k,i}\Big]

and the intra-anchor loss is

LRL=1Nanchork=1Nanchor1Gi=1G1Ttrunct=1Ttruncγt1logπθ(τt1k,iτtk,i)Ak,iL_{RL} = -\frac{1}{N_{\text{anchor}}} \sum_{k=1}^{N_{\text{anchor}}} \frac{1}{G} \sum_{i=1}^G \frac{1}{T_{\text{trunc}}} \sum_{t=1}^{T_{\text{trunc}}} \gamma^{t-1} \log \pi_\theta(\tau_{t-1}^{k,i} \mid \tau_t^{k,i}) A^{k,i}

Inter-Anchor Truncated GRPO

To provide a global reward signal without mode collapse, negative intra-anchor advantages are truncated to zero; colliding trajectories are assigned a hard penalty of 1-1: Atrunck,i={1,if τ0k,i collides max(0,Ak,i),otherwiseA_{\text{trunc}}^{k,i} = \begin{cases} -1, & \text{if } \tau_0^{k,i} \text{ collides} \ \max(0, A^{k,i}), & \text{otherwise} \end{cases} These are substituted into the RL loss.

5. Learning Algorithm and Objective Structure

The overall training objective sums the RL loss and a regularized imitation learning (IL) loss computed over the denoising-reconstruction and anchor-classification binary cross-entropy (BCE): Ltotal=LRL+λLIL,λ(0,1)L_{\text{total}} = L_{RL} + \lambda L_{IL}, \quad \lambda \in (0,1) High-level learning proceeds as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
for epoch in 1..E_rl:
    for batch in dataset:
        features  PerceptionNet(batch.sensors)
        for each anchor k:
            for i = 1..G:
                τ_T  noisy_anchor(a^k; features)
                τ_0  RunTruncatedDiffusion(τ_T; θ, η=1, multiplicative_noise)
                r[k,i]  SimulatorReward(τ_0)
        for each anchor k:
            compute A^{k,i} and then A_trunc^{k,i}
        L_RL  sum_{k,i,t} [ -γ^{t-1} log π_θ(·) A_trunc^{k,i} ]
        L_IL  imitationlearning loss on GT anchor
        θ  θ  AdamW(θ[L_RL + λ L_IL])

Following RL-stage training, a two-stage mode selector is trained on frozen generator outputs for 20 epochs, using BCE and margin-rank losses.

6. Experimental Protocol and Metrics

Key experimental details include:

  • Datasets: NAVSIM v1 (1,192 train, 136 test scenes) and NAVSIM v2 (with extended closed-loop metrics).
  • Inputs: ResNet-34 backbone aligned on BEV-LiDAR plus three front cameras (1024×256).
  • PDMS is the main closed-loop score: PDMS=NC×DAC×5EP+5TTC+2C12\mathrm{PDMS} = \mathrm{NC} \times \mathrm{DAC} \times \frac{5\,\mathrm{EP} + 5\,\mathrm{TTC} + 2\,C}{12} where NC: no-at-fault collisions, DAC: drivable-area compliance, EP: ego progress, TTC: time to collision, C: comfort.
  • Extended score on v2, EPDMS: EPDMS=NC×DAC×DDC×TL×5TTC+2C+5EP+5LK+5EC22\mathrm{EPDMS} = \mathrm{NC} \times \mathrm{DAC} \times \mathrm{DDC} \times \mathrm{TL} \times \frac{5\,\mathrm{TTC} + 2\,C + 5\,\mathrm{EP} + 5\,\mathrm{LK} + 5\,\mathrm{EC}}{22}
  • Diversity: average normalized pairwise waypoint distance: Divrawn=2M(M1)i<jpnipnj2,Divn=min(1,Divrawnϵ+1Mmpnm2)Div_{\text{raw}}^n = \tfrac{2}{M(M-1)}\sum_{i<j}\|p_n^i - p_n^j\|_2,\quad Div^n = \min\bigg(1, \frac{Div_{\text{raw}}^n}{\epsilon + \tfrac{1}{M} \sum_m\|p_n^m\|_2}\bigg)

7. Performance Evaluation and Ablation Studies

DiffusionDriveV2 achieves state-of-the-art performance on both NAVSIM v1 and v2. Key results:

Method Div. PDMS@1 PDMS@5 PDMS@10
TransfuserTD_{TD} 0.1 85.7
DiffusionDrive 42.3 75.3
DiffusionDriveV2 30.3 94.9 91.1 84.4

DiffusionDriveV2 attains 91.2 PDMS on NAVSIM v1 (+3.1 above DiffusionDrive, +2.9 above RL-based DIVER), and 85.5 EPDMS on NAVSIM v2.

Ablations on NAVSIM v1 (RL stage only) indicate:

  • Additive vs. multiplicative exploration noise: PDMS 89.7 → 90.1
  • Without Intra-Anchor GRPO: PDMS 89.2 → 90.1
  • Without Inter-Anchor truncation: PDMS 89.5 → 90.1

This suggests that each RL component contributes to the final closed-loop safety and multimodal intent retention.

8. Implementation and Reproducibility Details

Reinforcement learning is conducted for 10 epochs with batch size 512, AdamW optimizer (learning rate 2×1042 \times 10^{-4}, weight decay 1×1041 \times 10^{-4}), warmup 10%, cosine decay, and γ=0.8\gamma=0.8. The minimum multiplicative noise standard deviation is set to 0.04. The selector stage uses the same optimizer configuration, 20 epochs, and data augmentation with multiplicative noise sampled in [0.1,0.2][0.1, 0.2]. Inference employs two denoising steps as in DiffusionDrive. Code and models are publicly available at https://github.com/hustvl/DiffusionDriveV2.

In conclusion, DiffusionDriveV2 enforces RL constraints at both intra-anchor and inter-anchor levels with scale-adaptive noise and a two-stage selector, achieving an optimal trade-off between diversity and consistent trajectory quality. It sets new state-of-the-art scores on two closed-loop NAVSIM benchmarks with strong multimodality and rigorous safety (Zou et al., 8 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DiffusionDriveV2.