DiffusionDriveV2 Autonomous Driving Framework
- DiffusionDriveV2 is a reinforcement learning-constrained truncated diffusion framework that uses Gaussian mixture modeling and anchored intent representations for multimodal trajectory generation.
- It employs scale-adaptive multiplicative noise to foster robust exploration, ensuring diverse and high-quality trajectories in complex driving scenarios.
- The framework integrates intra-anchor and inter-anchor GRPO, achieving state-of-the-art closed-loop performance on NAVSIM benchmarks while ensuring safety.
DiffusionDriveV2 is a reinforcement learning-constrained truncated diffusion modeling framework for end-to-end autonomous driving, designed to resolve the persistent “diversity–quality dilemma” that arises when leveraging generative diffusion planners. It accomplishes this via a combination of Gaussian mixture modeling, anchored intent representations, scale-adaptive multiplicative exploration noise, and a dual-level reinforcement learning objective, achieving state-of-the-art closed-loop driving performance while preserving trajectory multimodality (Zou et al., 8 Dec 2025).
1. End-to-End Trajectory Generation with Anchored Diffusion
The autonomous driving policy in DiffusionDriveV2 is formulated as a mapping
where represents processed sensor inputs and is a sequence of future waypoints over a fixed planning horizon . Traditional imitation learning approaches produce single-mode outputs, failing to capture real-world multimodal intent. Vanilla diffusion models, though multimodal, suffer from mode collapse, generating conservative, mean-like behaviors on diverse driving scenarios.
DiffusionDrive introduced discrete intent anchors to partition the action space—e.g., “turn left,” “go straight”—with a truncated diffusion decoder acting over a Gaussian Mixture Model (GMM) prior. Supervision, however, was limited to the anchor closest to the ground truth, leaving other modes unconstrained and yielding low-quality, sometimes invalid trajectories.
DiffusionDriveV2 extends this strategy by incorporating reinforcement learning (RL) based constraints, both to penalize unsafe or low-quality modes and to encourage exploration toward higher-reward behaviors. The framework’s core advancements are its use of scale-adaptive multiplicative exploration noise, Intra-Anchor Group Relative Policy Optimization (GRPO), and Inter-Anchor Truncated GRPO.
2. Gaussian Mixture Modeling and Anchored Trajectory Diffusion
The conditional distribution over future trajectories, given perception , is modeled as a Gaussian mixture: with as softmax-weighted intent probabilities, as scene-dependent offsets, and as offsets' covariances. At inference, each anchor initiates a truncated diffusion chain: producing a pool of candidate trajectories that reflect distinct high-level driving intents.
3. Scale-Adaptive Multiplicative Noise for Exploration
Exploration in continuous control is a central challenge. DiffusionDriveV2 replaces standard additive Gaussian noise with two-degree-of-freedom multiplicative noise, tailored for vehicle trajectory planning: where
This scheme injects controlled scaling in both longitudinal and lateral directions. During RL-based training (DDIM/DDPM noise scale ), a minimum standard deviation constraint is enforced for both dimensions, preventing entropy collapse and fostering robust exploration. Deterministic sampling () is used during inference.
4. Reinforcement Learning Constraints: Intra- and Inter-Anchor GRPO
Trajectories for each anchor are generated by applying a Gaussian denoising policy via the truncated diffusion chain:
Intra-Anchor Group Relative Policy Optimization (GRPO)
For each anchor , sample trajectories are generated. Relative advantages are calculated within the anchor group: with the mean reward across group samples. The RL gradient for training is
and the intra-anchor loss is
Inter-Anchor Truncated GRPO
To provide a global reward signal without mode collapse, negative intra-anchor advantages are truncated to zero; colliding trajectories are assigned a hard penalty of : These are substituted into the RL loss.
5. Learning Algorithm and Objective Structure
The overall training objective sums the RL loss and a regularized imitation learning (IL) loss computed over the denoising-reconstruction and anchor-classification binary cross-entropy (BCE): High-level learning proceeds as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
for epoch in 1..E_rl: for batch in dataset: features ← PerceptionNet(batch.sensors) for each anchor k: for i = 1..G: τ_T ← noisy_anchor(a^k; features) τ_0 ← RunTruncatedDiffusion(τ_T; θ, η=1, multiplicative_noise) r[k,i] ← SimulatorReward(τ_0) for each anchor k: compute A^{k,i} and then A_trunc^{k,i} L_RL ← sum_{k,i,t} [ -γ^{t-1} log π_θ(·) A_trunc^{k,i} ] L_IL ← imitation‐learning loss on GT anchor θ ← θ − AdamW(∇θ[L_RL + λ L_IL]) |
Following RL-stage training, a two-stage mode selector is trained on frozen generator outputs for 20 epochs, using BCE and margin-rank losses.
6. Experimental Protocol and Metrics
Key experimental details include:
- Datasets: NAVSIM v1 (1,192 train, 136 test scenes) and NAVSIM v2 (with extended closed-loop metrics).
- Inputs: ResNet-34 backbone aligned on BEV-LiDAR plus three front cameras (1024×256).
- PDMS is the main closed-loop score: where NC: no-at-fault collisions, DAC: drivable-area compliance, EP: ego progress, TTC: time to collision, C: comfort.
- Extended score on v2, EPDMS:
- Diversity: average normalized pairwise waypoint distance:
7. Performance Evaluation and Ablation Studies
DiffusionDriveV2 achieves state-of-the-art performance on both NAVSIM v1 and v2. Key results:
| Method | Div. | PDMS@1 | PDMS@5 | PDMS@10 |
|---|---|---|---|---|
| Transfuser | 0.1 | 85.7 | — | — |
| DiffusionDrive | 42.3 | — | — | 75.3 |
| DiffusionDriveV2 | 30.3 | 94.9 | 91.1 | 84.4 |
DiffusionDriveV2 attains 91.2 PDMS on NAVSIM v1 (+3.1 above DiffusionDrive, +2.9 above RL-based DIVER), and 85.5 EPDMS on NAVSIM v2.
Ablations on NAVSIM v1 (RL stage only) indicate:
- Additive vs. multiplicative exploration noise: PDMS 89.7 → 90.1
- Without Intra-Anchor GRPO: PDMS 89.2 → 90.1
- Without Inter-Anchor truncation: PDMS 89.5 → 90.1
This suggests that each RL component contributes to the final closed-loop safety and multimodal intent retention.
8. Implementation and Reproducibility Details
Reinforcement learning is conducted for 10 epochs with batch size 512, AdamW optimizer (learning rate , weight decay ), warmup 10%, cosine decay, and . The minimum multiplicative noise standard deviation is set to 0.04. The selector stage uses the same optimizer configuration, 20 epochs, and data augmentation with multiplicative noise sampled in . Inference employs two denoising steps as in DiffusionDrive. Code and models are publicly available at https://github.com/hustvl/DiffusionDriveV2.
In conclusion, DiffusionDriveV2 enforces RL constraints at both intra-anchor and inter-anchor levels with scale-adaptive noise and a two-stage selector, achieving an optimal trade-off between diversity and consistent trajectory quality. It sets new state-of-the-art scores on two closed-loop NAVSIM benchmarks with strong multimodality and rigorous safety (Zou et al., 8 Dec 2025).