Papers
Topics
Authors
Recent
2000 character limit reached

nuPlan Benchmark: Autonomous Driving Evaluation

Updated 6 December 2025
  • The nuPlan Benchmark is a standardized closed-loop evaluation suite used to test motion planning algorithms in autonomous driving with realistic scenarios and composite metrics.
  • It harnesses over 1200 hours of diverse real-world driving data from major metropolitan areas, featuring multi-modal sensor recordings and detailed maneuver annotations.
  • The framework supports various planning approaches, integrates innovative reactive agent modeling, and utilizes rigorous metrics such as collision rate, time-to-collision, and composite scores to advance research.

The nuPlan Benchmark is a large-scale, standardized closed-loop motion planning and evaluation suite for autonomous driving, designed to test planners on real-world data with realistic agent interactions, diverse scenarios, and rigorous composite metrics. It provides a framework for apples-to-apples comparison of planning algorithms, enables diagnostic insight into planner robustness, and has catalyzed research into both imitation learning, reinforcement learning, and multi-agent traffic simulation. Initially built around real logs from four metropolitan regions, nuPlan has evolved to incorporate increasingly interactive and realistic agent modeling, culminating in recent extensions with generative simulation and learned traffic agents.

1. Dataset Composition, Scenario Taxonomy, and Annotation Protocol

nuPlan comprises 1282–1500 hours of human driving data collected in four major cities (Las Vegas, Boston, Pittsburgh, Singapore) under varying conditions (urban cores, intersections, left/right traffic, PUDOs, night/day, dry/wet). The raw dataset is composed of multi-modal sensor recordings (LiDAR 10–20 Hz, multi-camera 10–20 Hz, GPS/IMU 100 Hz, vehicle CAN), together with precise object tracks (autolabeled bounding boxes, kinematic states) for six classes, high-fidelity HD maps, and traffic-light state per lane connector (Karnchanachari et al., 7 Mar 2024, Caesar et al., 2021).

Scenarios are categorized using a taxonomy of 73–75 distinct types, mined via SQL-like filters and atomic primitives from logged data (e.g., lane changes, merges, unprotected turns, stationary-in-traffic, near multiple vehicles). These are stratified to construct evaluation splits—most notably “Val14” and “Test14-Hard”—for holding out diverse and rare maneuvers (Karnchanachari et al., 7 Mar 2024, Zhang et al., 9 Apr 2025). Each scenario is annotated with binary indicators, novelty/rarity score (ηi = –log f{s(i)}), and risk score (e.g., minimum time-to-collision to any other participant).

Annotation quality is quantified via detection metrics (F1 score, AMOTA, ID-switch rate) and global box refinement, with nuPlan’s autolabeling outperforming baseline trackers (Karnchanachari et al., 7 Mar 2024).

2. Closed-Loop Simulator Architecture, Agent Modeling, and Planning Protocol

The nuPlan simulator converts static driving logs into a reactive, closed-loop planning environment. At each 0.1 s tick, the planner ingests a history buffer, map context, and current traffic agent states, outputs a future ego trajectory (typically 8 s horizon, 10 Hz), which is tracked by an LQR-based controller and forward-integrated via a kinematic bicycle model. Other agents are simulated either by log replay (non-reactive), rule-based heuristics (IDM car-following), or learned models (Caesar et al., 2021, Cheng et al., 2023, Karnchanachari et al., 7 Mar 2024).

Key agent models:

  • IDM (Intelligent Driver Model): Default reactive background traffic (longitudinal car-following). Its limitations include lack of lateral negotiation, insensitivity to cut-in/cut-out, and unrealistic yielding/braking, causing overestimation of planner safety/comfort (Hagedorn et al., 16 Oct 2025, Peng et al., 13 Nov 2025).
  • SMART: Transformer-based learned traffic agent, discretizes history/future into tokens, predicts next-action tokens using causal self-attention, trained with token-noise augmentation and Waymo/nuPlan data. Integration as drop-in background agent renders simulation human-like and more challenging (Hagedorn et al., 16 Oct 2025).
  • Diffusion-based multi-agent simulation (nuPlan-R): Noise-decoupled, tokenized agents using Nexus Diffusion Transformer; interaction-aware agent selection computes agent relevance, restricting forward simulation to top-k interacting vehicles—agents outside this set use log-replay (Peng et al., 13 Nov 2025).
  • Bench2Drive-R: Latent diffusion renderer for sensor-level simulation, decoupling world-model from image generation, ensuring temporal-spatial coherence via retrieval and ControlNet (You et al., 11 Dec 2024).

The simulation loop enforces feedback between ego and background agents, with state updates and agent rollouts reflecting planner decisions.

3. Evaluation Metrics: Composite Scores and Sub-Metrics

nuPlan establishes a layered metric protocol for quantitative assessment:

  • General metrics:
    • Collision rate: average at-fault collisions per scenario
    • Off-road rate: fraction of time ego leaves drivable polygon
    • Time-To-Collision (TTC): minimum time margin to other agents
    • Drivable-area, speed-limit, and driving-direction compliance
    • Comfort: penalizing excessive acceleration/jerk (e.g., Comfort=11Tt=1Tat2\mathrm{Comfort} = 1 - \sqrt{\frac{1}{T}\sum_{t=1}^T a_t^2})
    • Route progress: normalized distance along expert route (Karnchanachari et al., 7 Mar 2024, Zhang et al., 9 Apr 2025, Caesar et al., 2021)

Aggregate closed-loop score (CLS) for a scenario is computed via weighted sums and multiplier penalties: CLS=(mMscorem)×wWweightwscorewwWweightw\mathrm{CLS} = \left(\prod_{m \in M} \mathrm{score}_m\right) \times \frac{\sum_{w \in W} \mathrm{weight}_w \mathrm{score}_w}{\sum_{w \in W} \mathrm{weight}_w} where MM is the set of hard multipliers (no collisions, drivable area, direction, progress), WW is soft metrics (TTC, route completion, speed, comfort) (Jaeger et al., 24 Apr 2025, Cheng et al., 22 Apr 2024).

  • Scenario-specific metrics: Lane-change min-gap, intersection right-of-way agreement rate, pedestrian/cyclist interaction velocity.
  • Extended metrics (nuPlan-R): Success Rate (SR), All-Core Pass Rate (PR), interaction fidelity (trajectory 2\ell_2/Fréchet distance), diversity score (behavioral mode clustering) (Peng et al., 13 Nov 2025).

4. Planner Types, Algorithmic Advances, and Benchmark Results

nuPlan supports rule-based, hybrid, imitation-learning, and reinforcement-learning planners.

Selected Val14 Closed-Loop Performance Table

Planner NR CLS R CLS SR (%) PR (%) Time (ms)
PDM-Closed 92.8 92.1 98.1 90.3 104
CaRL 91.3 90.6 97.4 91.3 14
PLUTO 93.2 92.1 98.3 93.7 237
Diffusion-ES 92.0 92.0
PlanTF 86.5 80.6

Rule-based and hybrid planners see performance degradation when exposed to learned/reactive backgrounds (nuPlan-R, SMART), with learning-based planners revealing greater generalization and adaptability under harder simulation (Peng et al., 13 Nov 2025, Hagedorn et al., 16 Oct 2025).

5. Extensions: Multi-Agent Reactivity and Sensor-Level Simulation

nuPlan-R and Bench2Drive-R mark a shift from passive or simplistic agent modeling to fully interactive multi-agent simulations:

  • nuPlan-R: Diffusion-based, noise-decoupled tokens enable adaptive agent rollouts; interaction-aware agent selection optimizes compute, focusing on high-impact vehicles. Evaluation includes SR/PR, fidelity, and diversity. Rule-based planners’ performance drops; learning-based planners display improved robustness in edge-cases (Peng et al., 13 Nov 2025).
  • Bench2Drive-R: Latent diffusion ControlNet renderer generates synthetic, temporally/spatially consistent sensor streams, allowing closed-loop evaluation with visual perception integration. Integrates with nuPlan’s behavioral controller; raises R-CLS by ≈2 points over baseline naive renderers (You et al., 11 Dec 2024).

6. Limitations, Diagnostic Insights, and Future Directions

Shortcomings of early nuPlan challenges include passive IDM backgrounds, metric dependence, and long-tail scenario imbalance (≈45% “stationary” frames vs. <5% dynamic interactions) (Zhang et al., 9 Apr 2025, Hagedorn et al., 16 Oct 2025).

Recent diagnostic insights:

Forward-looking research includes adversarial scenario generation, LLM-guided agent modeling, online adaptation to new behavioral distributions, and richer multi-modal sensor simulation (Peng et al., 13 Nov 2025, You et al., 11 Dec 2024).

7. Impact and Standardization in Autonomous Driving Research

nuPlan has established itself as the canonical planning benchmark for academic and industry research in autonomous driving, precisely because of its:

Recent extensions (nuPlan-R, SMART, Bench2Drive-R) are shifting nuPlan from simulation with rule-based agents toward high-fidelity, multi-agent, reactive, and sensor-aware simulation, improving benchmarking integrity and accelerating progress toward deployable autonomous planning systems.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to nuPlan Benchmark.