nuPlan Benchmark: Autonomous Driving Evaluation

Updated 6 December 2025

The nuPlan Benchmark is a standardized closed-loop evaluation suite used to test motion planning algorithms in autonomous driving with realistic scenarios and composite metrics.
It harnesses over 1200 hours of diverse real-world driving data from major metropolitan areas, featuring multi-modal sensor recordings and detailed maneuver annotations.
The framework supports various planning approaches, integrates innovative reactive agent modeling, and utilizes rigorous metrics such as collision rate, time-to-collision, and composite scores to advance research.

The nuPlan Benchmark is a large-scale, standardized closed-loop motion planning and evaluation suite for autonomous driving, designed to test planners on real-world data with realistic agent interactions, diverse scenarios, and rigorous composite metrics. It provides a framework for apples-to-apples comparison of planning algorithms, enables diagnostic insight into planner robustness, and has catalyzed research into both imitation learning, reinforcement learning, and multi-agent traffic simulation. Initially built around real logs from four metropolitan regions, nuPlan has evolved to incorporate increasingly interactive and realistic agent modeling, culminating in recent extensions with generative simulation and learned traffic agents.

1. Dataset Composition, Scenario Taxonomy, and Annotation Protocol

nuPlan comprises 1282–1500 hours of human driving data collected in four major cities (Las Vegas, Boston, Pittsburgh, Singapore) under varying conditions (urban cores, intersections, left/right traffic, PUDOs, night/day, dry/wet). The raw dataset is composed of multi-modal sensor recordings (LiDAR 10–20 Hz, multi-camera 10–20 Hz, GPS/IMU 100 Hz, vehicle CAN), together with precise object tracks (autolabeled bounding boxes, kinematic states) for six classes, high-fidelity HD maps, and traffic-light state per lane connector (Karnchanachari et al., 7 Mar 2024, Caesar et al., 2021).

Scenarios are categorized using a taxonomy of 73–75 distinct types, mined via SQL-like filters and atomic primitives from logged data (e.g., lane changes, merges, unprotected turns, stationary-in-traffic, near multiple vehicles). These are stratified to construct evaluation splits—most notably “Val14” and “Test14-Hard”—for holding out diverse and rare maneuvers (Karnchanachari et al., 7 Mar 2024, Zhang et al., 9 Apr 2025). Each scenario is annotated with binary indicators, novelty/rarity score (ηi = –log f{s(i)}), and risk score (e.g., minimum time-to-collision to any other participant).

Annotation quality is quantified via detection metrics (F1 score, AMOTA, ID-switch rate) and global box refinement, with nuPlan’s autolabeling outperforming baseline trackers (Karnchanachari et al., 7 Mar 2024).

2. Closed-Loop Simulator Architecture, Agent Modeling, and Planning Protocol

The nuPlan simulator converts static driving logs into a reactive, closed-loop planning environment. At each 0.1 s tick, the planner ingests a history buffer, map context, and current traffic agent states, outputs a future ego trajectory (typically 8 s horizon, 10 Hz), which is tracked by an LQR-based controller and forward-integrated via a kinematic bicycle model. Other agents are simulated either by log replay (non-reactive), rule-based heuristics (IDM car-following), or learned models (Caesar et al., 2021, Cheng et al., 2023, Karnchanachari et al., 7 Mar 2024).

Key agent models:

IDM (Intelligent Driver Model): Default reactive background traffic (longitudinal car-following). Its limitations include lack of lateral negotiation, insensitivity to cut-in/cut-out, and unrealistic yielding/braking, causing overestimation of planner safety/comfort (Hagedorn et al., 16 Oct 2025, Peng et al., 13 Nov 2025).
SMART: Transformer-based learned traffic agent, discretizes history/future into tokens, predicts next-action tokens using causal self-attention, trained with token-noise augmentation and Waymo/nuPlan data. Integration as drop-in background agent renders simulation human-like and more challenging (Hagedorn et al., 16 Oct 2025).
Diffusion-based multi-agent simulation (nuPlan-R): Noise-decoupled, tokenized agents using Nexus Diffusion Transformer; interaction-aware agent selection computes agent relevance, restricting forward simulation to top-k interacting vehicles—agents outside this set use log-replay (Peng et al., 13 Nov 2025).
Bench2Drive-R: Latent diffusion renderer for sensor-level simulation, decoupling world-model from image generation, ensuring temporal-spatial coherence via retrieval and ControlNet (You et al., 11 Dec 2024).

The simulation loop enforces feedback between ego and background agents, with state updates and agent rollouts reflecting planner decisions.

3. Evaluation Metrics: Composite Scores and Sub-Metrics

nuPlan establishes a layered metric protocol for quantitative assessment:

General metrics:
- Collision rate: average at-fault collisions per scenario
- Off-road rate: fraction of time ego leaves drivable polygon
- Time-To-Collision (TTC): minimum time margin to other agents
- Drivable-area, speed-limit, and driving-direction compliance
- Comfort: penalizing excessive acceleration/jerk (e.g., $\mathrm{Comfort} = 1 - \sqrt{\frac{1}{T}\sum_{t=1}^T a_t^2}$ )
- Route progress: normalized distance along expert route (Karnchanachari et al., 7 Mar 2024, Zhang et al., 9 Apr 2025, Caesar et al., 2021)

Aggregate closed-loop score (CLS) for a scenario is computed via weighted sums and multiplier penalties: $\mathrm{CLS} = \left(\prod_{m \in M} \mathrm{score}_m\right) \times \frac{\sum_{w \in W} \mathrm{weight}_w \mathrm{score}_w}{\sum_{w \in W} \mathrm{weight}_w}$ where $M$ is the set of hard multipliers (no collisions, drivable area, direction, progress), $W$ is soft metrics (TTC, route completion, speed, comfort) (Jaeger et al., 24 Apr 2025, Cheng et al., 22 Apr 2024).

Scenario-specific metrics: Lane-change min-gap, intersection right-of-way agreement rate, pedestrian/cyclist interaction velocity.
Extended metrics (nuPlan-R): Success Rate (SR), All-Core Pass Rate (PR), interaction fidelity (trajectory $\ell_2$ /Fréchet distance), diversity score (behavioral mode clustering) (Peng et al., 13 Nov 2025).

4. Planner Types, Algorithmic Advances, and Benchmark Results

nuPlan supports rule-based, hybrid, imitation-learning, and reinforcement-learning planners.

Rule-based (IDM, PDM-Closed): Benchmarks for comfort and safety; PDM-Closed optimized for CLS (92–93, Val14) (Cheng et al., 22 Apr 2024, Jaeger et al., 24 Apr 2025, Yang et al., 9 Feb 2024).
Imitation learning (PlanTF, PLUTO, CAFE-AD): PlanTF uses minimal history and State Dropout Encoder (SDE); PLUTO employs contrastive losses and post-processing; CAFE-AD integrates adaptive pruning and cross-scenario interpolation to boost robustness in rare scenarios (Cheng et al., 2023, Cheng et al., 22 Apr 2024, Zhang et al., 9 Apr 2025).
Reinforcement learning (CaRL): Proximal Policy Optimization (PPO) with single-term route-completion reward, achieving 91.3 NR–CLS/90.6 R–CLS, matching rule-based baselines; distributed training scales to 500M samples (Jaeger et al., 24 Apr 2025).
Test-time optimization (Diffusion-ES): Evolutionary search sampling trajectories from a diffusion model, mutation via truncated denoising, scoring with non-differentiable nuPlan reward; matches PDM-Closed on Driving Score (92 Val14) and excels at instruction-shaped tasks (Yang et al., 9 Feb 2024).
MPC with adaptive world models (AdaptiveDriver): GCNN-based BehaviorNet predicts city/log-specific IDM parameters for agents, which are rolled out inside MPC cost/minimization, enabling scenario-conditioned planning and cross-city generalization (Vasudevan et al., 15 Jun 2024).

Selected Val14 Closed-Loop Performance Table

Planner	NR CLS	R CLS	SR (%)	PR (%)	Time (ms)
PDM-Closed	92.8	92.1	98.1	90.3	104
CaRL	91.3	90.6	97.4	91.3	14
PLUTO	93.2	92.1	98.3	93.7	237
Diffusion-ES	92.0	92.0	—	—	—
PlanTF	86.5	80.6	—	—	—

Rule-based and hybrid planners see performance degradation when exposed to learned/reactive backgrounds (nuPlan-R, SMART), with learning-based planners revealing greater generalization and adaptability under harder simulation (Peng et al., 13 Nov 2025, Hagedorn et al., 16 Oct 2025).

5. Extensions: Multi-Agent Reactivity and Sensor-Level Simulation

nuPlan-R and Bench2Drive-R mark a shift from passive or simplistic agent modeling to fully interactive multi-agent simulations:

nuPlan-R: Diffusion-based, noise-decoupled tokens enable adaptive agent rollouts; interaction-aware agent selection optimizes compute, focusing on high-impact vehicles. Evaluation includes SR/PR, fidelity, and diversity. Rule-based planners’ performance drops; learning-based planners display improved robustness in edge-cases (Peng et al., 13 Nov 2025).
Bench2Drive-R: Latent diffusion ControlNet renderer generates synthetic, temporally/spatially consistent sensor streams, allowing closed-loop evaluation with visual perception integration. Integrates with nuPlan’s behavioral controller; raises R-CLS by ≈2 points over baseline naive renderers (You et al., 11 Dec 2024).

6. Limitations, Diagnostic Insights, and Future Directions

Shortcomings of early nuPlan challenges include passive IDM backgrounds, metric dependence, and long-tail scenario imbalance (≈45% “stationary” frames vs. <5% dynamic interactions) (Zhang et al., 9 Apr 2025, Hagedorn et al., 16 Oct 2025).

Recent diagnostic insights:

Compounding errors from history-based imitation models; SDE and minimal pose input mitigate shortcut learning (Cheng et al., 2023, Cheng et al., 22 Apr 2024).
Hybrid planners excel at open-loop metrics, but degrade in closed-loop realism (sim-to-real gap with SMART/nuPlan-R agents) (Hagedorn et al., 16 Oct 2025, Peng et al., 13 Nov 2025).
Scenario augmentation and cross-scenario adaptation are critical to overcome overfitting in rare edge-cases (Zhang et al., 9 Apr 2025).
Closed-loop training, reward alignment with benchmark CLS, and multi-agent sensitivity offer substantial gains (Jaeger et al., 24 Apr 2025, Vasudevan et al., 15 Jun 2024).

Forward-looking research includes adversarial scenario generation, LLM-guided agent modeling, online adaptation to new behavioral distributions, and richer multi-modal sensor simulation (Peng et al., 13 Nov 2025, You et al., 11 Dec 2024).

7. Impact and Standardization in Autonomous Driving Research

nuPlan has established itself as the canonical planning benchmark for academic and industry research in autonomous driving, precisely because of its:

Large-scale, scenario-comprehensive data and annotation (Caesar et al., 2021, Karnchanachari et al., 7 Mar 2024).
Closed-loop evaluation enforcing feedback between ego and background agents.
Rigorous, composite metric protocol accounting for safety, comfort, legality, and interactive robustness (Cheng et al., 22 Apr 2024, Peng et al., 13 Nov 2025).
Open-source ecosystem stimulating planner comparison, diagnostic critique, and advancement of learning-based, hybrid, and generative agent modeling (Hagedorn et al., 16 Oct 2025, Peng et al., 13 Nov 2025, You et al., 11 Dec 2024).

Recent extensions (nuPlan-R, SMART, Bench2Drive-R) are shifting nuPlan from simulation with rule-based agents toward high-fidelity, multi-agent, reactive, and sensor-aware simulation, improving benchmarking integrity and accelerating progress toward deployable autonomous planning systems.