NuPlan: Autonomous Driving Closed-Loop Benchmark

Updated 26 December 2025

NuPlan is a closed-loop planning benchmark and dataset featuring long-horizon planning and reactive simulation of complex urban driving scenarios.
It incorporates 1,300+ hours of multi-city driving logs, diverse sensor data, and detailed scenario taxonomy for comprehensive performance assessment.
The benchmark enables rigorous evaluation and comparative analysis of rule-based, hybrid, and learning-based planners by simulating realistic multi-agent interactions.

NuPlan is a large-scale, closed-loop planning benchmark and dataset for autonomous driving that enables rigorous evaluation of motion planning algorithms under realistic, reactive simulation of urban traffic environments. In contrast to earlier motion forecasting datasets focused on short-term, open-loop prediction, nuPlan supports long-horizon planning, agent interaction, and aggregated scenario-based metrics designed to reveal both the strengths and failure modes of machine learning–based, rule-based, and hybrid planners. Its simulation and scoring protocols have driven a shift in research on generalization, robustness, and real-world deployment of planning systems.

1. Benchmark Composition and Scenario Taxonomy

NuPlan comprises 1,300+ hours of ego-centric driving logs collected across four cities (Las Vegas, Boston, Pittsburgh, Singapore), covering diverse urban geometries and local driving cultures (Karnchanachari et al., 7 Mar 2024, Caesar et al., 2021). Its sensor suite includes five LiDARs (20 Hz), eight surround-view cameras (10 Hz), GNSS/IMU (20 Hz/100 Hz), and high-definition semantic maps with vector-layer representations (lanes, connectors, drivable areas, crosswalks, etc.). Maps and object tracks are exhaustively labeled for six agent classes (vehicles, pedestrians, cyclists, barriers, cones, generic), and object detection/tracking annotation is provided primarily via auto-labeling pipelines based on MVF++ and Kalman filter tracking.

Scenarios are algorithmically mined using atomic event primitives (e.g., intersection entry, high lateral acceleration, dense traffic vicinity) and composed into 73 scenario types, spanning both common and rare events (lane changes, left turns, following, near-miss, PUDOs, etc.) (Karnchanachari et al., 7 Mar 2024). Dedicated splits (Train/Val/Test14, including “Test14-Hard” for rare/long-tail events) are provided with strict temporal and spatial separation to assess both within-domain and cross-domain generalization (Karnchanachari et al., 7 Mar 2024, Sun et al., 21 Oct 2024).

2. Closed-Loop Simulation and Reactive Agent Modeling

At the core of nuPlan is its closed-loop simulation apparatus. Rather than static, open-loop evaluation (where chains of prediction are not executed), nuPlan “puts the planner in the loop” by integrating proposed ego trajectories through a kinematic or dynamic bicycle model, optionally with PID/LQR control tracking (Caesar et al., 2021, Karnchanachari et al., 7 Mar 2024). Non-ego agents can operate in three modes:

Log-Replay: Background agents exactly replay their logged trajectories.
Reactive IDM: The Intelligent Driver Model (IDM) governs longitudinal acceleration, supplemented with simple MOBIL-style heuristics for lane-changing (Hagedorn et al., 16 Oct 2025, Karnchanachari et al., 7 Mar 2024).
Learned Reactive Agents: More recent extensions (nuPlan-R; SMART) replace IDM with diffusion-based (Peng et al., 13 Nov 2025) or transformer-based (Hagedorn et al., 16 Oct 2025) multi-agent policies, which can capture lane changes, adversarial maneuvers, and multi-modal interactions unavailable to rule-based models.

Simulation operates at 10 Hz, with all agent states updated given the simulated ego, and all metrics recomputed from the resulting traces. This enables direct quantification of planner safety, comfort, efficiency, and resilience in dynamic interactive traffic, addressing limitations of open-loop datasets such as nuScenes (Hallgarten et al., 11 Apr 2024).

3. Evaluation Metrics and Scoring Protocols

NuPlan defines comprehensive, scenario-based scoring metrics designed to stress different planning objectives (Karnchanachari et al., 7 Mar 2024, Dauner et al., 2023, Sun et al., 21 Oct 2024):

Open-Loop Score (OLS): Average and final displacement error (ADE, FDE) over 8 s, heading errors (AHE/FHE), and miss rate (MR), relative to expert human trajectories. Used primarily for imitation-learning assessment.
Closed-Loop Scores (CLS): Aggregated weighted averages of progress (fraction of route completed), safety-related violations (collision, off-road, wrong direction), comfort (acceleration/jerk), speed-limit adherence, minimum time-to-collision (TTC), and “multiplier” penalties that gate sub-scores: any collision, off-road departure, or failure to make progress zeros the score for the scenario (Karnchanachari et al., 7 Mar 2024, Hagedorn et al., 16 Oct 2025).
Extended Metrics (nuPlan-R): Success Rate (SR; fraction of scenarios without catastrophic failure) and All-Core Pass Rate (PR; proportion of scenarios scoring above 0.5 on every submetric) for robust, balanced score reporting (Peng et al., 13 Nov 2025).

The official evaluation protocol cycles planners through open-loop, non-reactive closed-loop (log replay), and reactive closed-loop (IDM/nuPlan-R/SMART) tests, with “Test14” and “Test14-Hard” evaluating domain and tail-case generalization, respectively (Sun et al., 21 Oct 2024, Zhang et al., 9 Apr 2025).

4. Planner Paradigms and Empirical Benchmarks

NuPlan supports, and has driven benchmark comparisons across, a spectrum of planner designs:

Rule-Based: IDM-based planners with HD map–centerline following set a robust baseline, dominating safety and compliance scores on simple and non-reactive scenarios (Dauner et al., 2023, Sun et al., 21 Oct 2024).
Hybrid: PDM-Closed/PDM-Hybrid fuse rule-based proposal generation with lightweight neural refinement, securing state-of-the-art closed-loop performance on Val14 and Test14-Hard (CLS-R ≈ 92, OLS ≈ 84) (Dauner et al., 2023, Sun et al., 21 Oct 2024).
Imitation/Reinforcement Learning: UrbanDriver, PlanCNN, and STR2 (MoE Transformer) achieve strong open-loop errors but often struggle to generalize in closed-loop, interaction-heavy settings unless explicitly trained for reactivity (Sun et al., 21 Oct 2024, Zhang et al., 9 Apr 2025).
Generative/Prediction-Driven: Diffusion-ES and recent diffusion planners conduct reward-guided (gradient-free) search for trajectories backed by learned diffusion models, matching or exceeding rule-based SOTA, and uniquely supporting black-box, non-differentiable, or instruction-based reward shaping (Yang et al., 9 Feb 2024, Steiner et al., 3 Dec 2025).
Cognitive/Tokenized: Configurations such as belief–intent token models (TIWM) explore minimal, semantically rich representations that enable strong open-loop planning with as few as 16 tokens (Sang, 30 Oct 2025).

The competitive landscape has shown that rule-based and hybrid planners are extremely difficult to surpass in core nuPlan metrics, but as simulation realism and scenario diversity increase (e.g., via nuPlan-R or SMART agents), learning-based and closed-loop–trained planners recover much of the gap, especially in non-trivial, highly interactive edge cases (Hagedorn et al., 16 Oct 2025, Peng et al., 13 Nov 2025, Hallgarten et al., 11 Apr 2024).

5. Impact of Reactive Agent Realism

Emerging research demonstrates that evaluation with simplistic rule-based agents (IDM) substantially overestimates planner robustness and conceals failure modes, particularly for learning-based approaches that do not reason about rich multi-agent interactions (Peng et al., 13 Nov 2025, Hagedorn et al., 16 Oct 2025). The integration of learned, either diffusion-based (nuPlan-R) or transformer-based (SMART), reactive agents into nuPlan closed-loop simulation yields:

Reduced CLS for rule- and hybrid planners; increased required negotiation, yielding, and cut-in response.
Markedly improved discrimination for learned and generative planners, emphasizing closed-loop training and interaction modeling as critical for real-world deployment (Peng et al., 13 Nov 2025, Hagedorn et al., 16 Oct 2025).
New standard metrics (SR, PR) that expose “spiky” planner performance and penalize solutions that focus only on progress or safety at the expense of holistic driving quality.

This trend motivates the community to retire IDM as the default reactive agent in favor of learned multi-agent simulation for fair and realistic evaluation.

6. Generalization, Long-Tail, and Dataset Limitations

Despite its scope, nuPlan’s original scenario set is biased toward “basic” lane-following and progress tasks, with limited coverage of rare or adversarial events (emergent obstacles, high-frequency merging, pedestrian deviation) (Hallgarten et al., 11 Apr 2024). Even the Test14-Hard split, while more challenging, captures only a small slice of the real-world long-tailed distribution. As such, planners scoring near-perfectly on Val14 or closed-loop CLS may still fail under rare conditions or when evaluated via “interPlan” (purpose-built edge-case) splits, where all methods see substantial drops (e.g., PDM-Closed: 92 → 42) (Hallgarten et al., 11 Apr 2024, Sun et al., 21 Oct 2024).

Recent work addresses these deficiencies via:

Cross-Scenario Adaptive modules (CAFE-AD) that prune or interpolate features to avoid overfitting dominant scenarios and explicitly improve robustness on “Test14-Hard” (Zhang et al., 9 Apr 2025).
Large-scale scaling of model/data (“STR2”; testing on 1B+ scenes) to demonstrate that performance continues to improve given broader data and larger capacity, though high-fidelity simulation and annotation remain rate-limiting (Sun et al., 21 Oct 2024).
Occupancy-centric extensions (Nuplan-Occ) to enable 4D generative modeling of scene semantics, supporting downstream perception and planning in multi-modal domains (Li et al., 27 Oct 2025).

A plausible implication is that future iterations, and external benchmarks such as interPlan, are required to further stress generalization, rare event response, and dynamic agent diversity.

7. Software Ecosystem, Extensibility, and Downstream Applications

The nuPlan-devkit provides Python APIs for scenario loading, sensor data access, simulation, and metrics computation, supporting both offline planning research and closed-loop, batch-evaluated benchmarks (Karnchanachari et al., 7 Mar 2024). All major innovation directions—reactive agent upgrades, scenario tagging, metric expansion, scene-generation pipelines (UniScenev2), and training support—are maintained as extensible submodules.

Downstream, nuPlan and its extensions have catalyzed:

End-to-end driving stack prototyping (sampling-based, imitation, generative, or RL training pipelines).
Perception, occupancy, and video generation tasks (via Nuplan-Occ, UniScenev2), providing synthetic data for benchmarking and algorithmic development (Li et al., 27 Oct 2025).
Continual open-source development and community challenges (e.g., annual NuPlan Challenge benchmark, released SMART and nuPlan-R agents).

The consensus is that nuPlan serves as the de facto reference for real-world, closed-loop motion planning evaluation, setting the baseline for both algorithmic innovation and cross-dataset/system generalization studies (Karnchanachari et al., 7 Mar 2024, Dauner et al., 2023, Sun et al., 21 Oct 2024, Peng et al., 13 Nov 2025, Zhang et al., 9 Apr 2025, Hagedorn et al., 16 Oct 2025).