NuPlan Benchmark for AV Planning

Updated 28 September 2025

NuPlan Benchmark is a comprehensive closed-loop evaluation platform that integrates high-fidelity, multi-city driving data, modular simulation, and scenario-based metrics.
It enables fair and standardized assessment of ML-based planning algorithms by simulating interactive, long-horizon driving scenarios.
Its metrics focus on traffic rule compliance, human-likeness, and vehicle dynamics, driving research into robust and safe autonomous vehicle decision-making.

NuPlan Benchmark is a closed-loop, machine-learning–based evaluation platform for autonomous vehicles (AVs) that integrates a large-scale, high-diversity real-world driving dataset, a modular simulation environment with support for agent interactions, and scenario-based, multi-faceted assessment metrics. It is designed to catalyze progress in long-term planning for AVs by addressing fundamental limitations of prior, open-loop–centric evaluation paradigms.

1. Motivation and Positioning

NuPlan is constructed to overcome the inadequacies of conventional motion prediction benchmarks, which predominantly rely on short-term, open-loop trajectory forecasting and L2-based metrics. Such metrics ignore the sequential, interactive nature of planning and fail to quantify goal achievement, rule compliance, and system-level robustness. NuPlan’s primary objectives are:

To drive development and fair comparison of ML-based planning algorithms by providing a unified simulation and evaluation framework.
To simulate the full feedback loop of ego-action and world evolution, including the impact of the ego's decisions on surrounding agents and overall scenario outcome.
To foster a standard platform for evaluating both general and scenario-specific behavioral competencies over diverse and long driving episodes.

In contrast to existing motion forecasting datasets, NuPlan centers on closed-loop planning, emphasizes goal-directed evaluation, and expands assessment from simple L2 metrics toward holistic planning-relevant criteria (Caesar et al., 2021).

2. Dataset Composition and Features

The NuPlan dataset consists of 1500 hours of high-fidelity driving data spanning four cities (Las Vegas, Boston, Pittsburgh, Singapore). Coverage includes:

City-scale geographic and traffic pattern diversity: e.g., Las Vegas with dense, multi-lane casino districts, Boston's double parking and turn precedence, Singapore’s left-hand traffic flow.
Detailed sensory and map data: high-resolution lidar, camera imagery, localization, steering signals, semantic lane maps, and traffic light statuses inferred from agent and light trajectory analysis.
Auto-labeled, globally refined 3D object tracks for multiple dynamic agents (vehicles, pedestrians, cyclists).
Scenario tags for complex events such as merges, pedestrian crossings, and unprotected turns.

Unique among benchmarks, only a stratified sensor data subset (~128 hours) is released alongside 1500 hours of structured object track and semantic map data, due to the volume exceeding 200 TB (Karnchanachari et al., 2024). Scenario mining and taxonomization enable fine-grained, scenario-specific planning diagnostics.

3. Simulation and Evaluation Framework

NuPlan’s closed-loop simulator operates as a modular platform in which a submitted planning module interacts via standardized interfaces to propose ego trajectories at each simulation step. Key elements include:

Closed-loop feedback: At each time step, the planner’s output is tracked with a predefined controller (e.g., LQR) and a kinematic bicycle or IDM vehicle model, feeding the next world state—including updated ego, agents, and traffic light conditions—back into the planner.
Reactive and non-reactive agent support: Background vehicles can follow observed (log-replay) trajectories or act as reactive agents using multi-modal policies (e.g., IDM), thus emulating more realistic, interaction-rich traffic (Caesar et al., 2021).
Interaction-aware rollout: The system records the full simulation evolution, permitting introspection of compounding errors, feedback-induced distribution shift, and planner-induced agent adaptation.

This structure enables rigorous assessment of long-term, interactive planning performance beyond the limitations of open-loop evaluation (Karnchanachari et al., 2024).

4. Metrics and Scoring Methodology

NuPlan employs a two-tiered hierarchy of evaluation metrics, each aligned to core planning competencies:

Core Metric Categories

Traffic Rule Compliance: Collision rates, off-road incidents, minimum time gap, time to collision (TTC), and velocity profiles for overtaking or yielding.
Human-Likeness and Trajectory Fidelity: L2 final goal error, velocity profile similarity, stop/lateral error, and comfort-related measures (acceleration, jerk, steering rate).
Vehicle Dynamics and Comfort: Monitoring for oscillations, violation of dynamic constraints, and ride smoothness.
Scenario-based Assessment: Fine-grained metrics for events such as lane changes (TTC to agents in target lane), unprotected turns (cross-traffic time windows), and multimodal human-agent interactions.

Metric Aggregation

Scenario-level scores are computed by combining “multiplier metrics” (e.g., binary pass/failure for collision avoidance, drivable area compliance) with weighted averages of continuous “average metrics”. The canonical formula is:

1	scenario_score = (∏i score_i) × (Σj weight_j · score_j)

where i indexes binary/hard constraint metrics, and j indexes continuous/comfort/progress terms (Karnchanachari et al., 2024).

The final planner score is then obtained by averaging scenario scores over test splits—e.g., Val14, Test14, Test14–hard—with special emphasis on challenging, rare (“long-tail”) scenario generalization (Karnchanachari et al., 2024).
Mathematical example for goal L2 error:

$L2_{error} = \sqrt{(x_{pred} - x_{goal})^2 + (y_{pred} - y_{goal})^2}$

Aggregation weights and thresholds are community-driven and periodically refined (Caesar et al., 2021).

5. Scenario Structure and Generalization

NuPlan scenarios are constructed to span both representative and rare events:

Basic scenarios: Lane following, normal yielding, speed adaptation under common traffic flux.
Complex/edge (“long-tail”) scenarios: Unprotected turns, interaction with double-parked vehicles, pedestrian crossings, complex merges. Edge cases are drawn from scenario mining and taxonomy formalized over 70+ types (Karnchanachari et al., 2024).

Despite this coverage, subsequent work has shown that benchmarks such as interPlan—explicitly engineered to stress “long-tail” generalization, e.g., construction zones or overtaking with oncoming traffic—surface additional gaps not present in nuPlan. In these settings, simple centerline-following planners often score highly on nuPlan but fail in complex, interactive episodes (Hallgarten et al., 2024). A plausible implication is that ongoing expansion and scenario enrichment remain necessary to maintain benchmark relevance as planner capabilities mature.

6. Implications for ML Planning Research and Benchmark Usage

NuPlan has become a foundational testbed for ML-based planning, enabling:

Standardized, closed-loop evaluation of diverse algorithms—ranging from pure imitation learning, diffusion models, and reinforcement learning, to hybrids that incorporate rule-based safety post-processing (Cheng et al., 2023, Zheng et al., 26 Jan 2025, Jaeger et al., 24 Apr 2025).
Controlled assessment of generalization: Separating open-loop mimicry from long-horizon, feedback-driven planning, as highlighted by the “imitation gap” (differences between open- and closed-loop performance) (Cheng et al., 2023).
Diagnosis of compounding error and distribution shift, exposing limitations of pure behavior cloning and suggesting the need for adaptation (e.g., using state perturbation augmentation, attention-based input pruning, or cross-scenario feature interpolation) (Zhang et al., 9 Apr 2025).
Comparative evaluation across baseline classes—constant-velocity, IDM/replay, full ML, and hybrid planners—on unified metrics, revealing the contextual performance gaps and safety margins essential for AV deployment (Karnchanachari et al., 2024).

This unification of data, simulation, and standard evaluation accelerates algorithm development cycles and enables reproducible, peer-comparable progress reports across the field.

7. Technical Implementation and Community Evolution

NuPlan supports containerized code submissions and exposes a modular simulation server for easy integration of custom planners.
Motion models and tracking controllers (e.g., LQR, kinematic bicycle) approximate real vehicle response to planned trajectories, bridging the gap between simulated and deployed control (Karnchanachari et al., 2024).
The open-source dataset and codebase are accessible via https://nuplan.org, facilitating broad community engagement and ongoing benchmark refinement.

Community-driven metric updates, scenario expansions, and competitive challenges (e.g., at NeurIPS and other venues) ensure that evaluation standards stay aligned with practical AV requirements (Caesar et al., 2021).

NuPlan has established itself as a critical resource for autonomous driving planning research by marrying high-diversity real-world datasets with interactive, closed-loop simulation and rigorous planning-centric metrics. This platform underpins advancements in learning-based planning, supports nuanced diagnostics of behavior and failure, and sets the stage for research into truly robust, safe, and human-like decision-making systems for real-world autonomous vehicles.