Papers
Topics
Authors
Recent
2000 character limit reached

Bench2Drive: E2E Driving Benchmark

Updated 4 December 2025
  • Bench2Drive is a large-scale, task-disentangled closed-loop benchmark that rigorously evaluates end-to-end autonomous driving performance using diverse, short-route scenarios in the CARLA simulator.
  • It integrates multi-modal sensor data and detailed route annotations to compute metrics like Driving Score, Success Rate, and Efficiency with reduced variance.
  • The benchmark has driven comparative studies of various architectures and methods, establishing a reproducible foundation for advancing robust autonomous driving research.

Bench2Drive is a large-scale, task-disentangled closed-loop benchmark for evaluating end-to-end autonomous driving (E2E-AD) systems. Developed to address the methodological and statistical limitations of prior open-loop and long-route protocols, it provides a comprehensive, fine-grained, and reproducible platform for assessing the proficiency, robustness, and efficiency of E2E-AD models across diverse interactive scenarios and operational domains. Bench2Drive has become the standard for reporting closed-loop E2E-AD performance in the CARLA simulator, underpinning comparative studies of architectures, learning paradigms, and real-time inference techniques.

1. Benchmark Motivation and Design Principles

Bench2Drive was introduced in response to the inadequacies of open-loop evaluation—as exemplified by nuScenes L₂ error or collision-rate metrics—and high-variance, long-route closed-loop protocols such as CARLA Town05Long or Leaderboard v2.0 (Jia et al., 6 Jun 2024). Open-loop metrics fail to capture closed-loop self-correction and compounding error, while single-scenario long routes induce high variance and conflate unrelated task competencies.

To remedy these weaknesses, Bench2Drive adopts:

  • Task disentanglement: Each route isolates a distinct interactive scenario (e.g., merging, overtaking, emergency braking, give-way, traffic-sign compliance), promoting granular ability analysis.
  • Short route design: Each of 220 routes (44 scenarios × 5 variants) is ≈150–600 m, strictly containing one core challenge, thereby smoothing driving scores and reducing metric variance.
  • Diversity and realism: Scenarios are distributed across 12 CARLA towns, 23 weather conditions, multiple traffic densities, and both day/night cycles, minimizing overfitting to specific geographies or climates.
  • Official, expert-driven data: Training clips are collected from a high-performance expert (“Think2Drive”) and cover ∼2 million frames with multi-sensor, multi-agent, and map annotation.

2. Dataset Composition and Scenarios

Bench2Drive’s official training set consists of approximately 10,000 short clips (~2M frames) at 10 Hz, split into miniature (10), base (1,000), and full (10,000) variants (Jia et al., 6 Jun 2024). Each data point contains:

  • Multi-modal sensors: Four-to-six surround RGB cameras (900×1600 px), LiDAR (64 channel, 85 m), radar, IMU/GNSS, and HD-map features for each route.
  • Frame-level annotations: 3D bounding boxes, semantic/instance masks, object/traffic signal states, high-level commands, and expert trajectory features.
  • Scenario taxonomy: 44 interactive scenarios distributed over the five primary abilities—Merging (16), Overtaking (9), Emergency Brake (12), Give Way (2), and Traffic Sign compliance (18, overlapping).
  • Evaluation split: 220 closed-loop routes; each scenario is instantiated five ways by varying the town, weather, or dynamic context, ensuring cross-domain task generalization.

3. Evaluation Protocol and Metrics

Bench2Drive mandates full closed-loop evaluation, where the candidate policy must generate low-level control or waypoints at each step, with the environment updating in response and perception re-injected at each loop (Jia et al., 6 Jun 2024, Sun et al., 13 Jun 2025).

Primary metrics:

  • Success Rate (SR):

SR=NsuccessNtotal×100%\mathrm{SR} = \frac{N_{\mathrm{success}}}{N_{\mathrm{total}}} \times 100\%

Fraction of routes completed without major infractions (collision, off-road, timeout).

  • Driving Score (DS):

DS=1Ntotali=1Ntotal(RCij=1Kipi,j)\mathrm{DS} = \frac{1}{N_\mathrm{total}} \sum_{i=1}^{N_\mathrm{total}} (\mathrm{RC}_i \cdot \prod_{j=1}^{K_i} p_{i,j})

A per-route product of route completion (fraction of route traversed) and multiplicative penalties pi,j(0,1]p_{i,j} \in (0,1] for infractions (e.g., collisions: 0.5–0.6, red-light: 0.7, timeout: 0.7).

  • Efficiency (EFF):

EFF=vˉegomax(speed limit,1m/s)×100%\mathrm{EFF} = \frac{\bar v_{\mathrm{ego}}}{\max(\text{speed limit}, 1\,\mathrm{m/s})} \times 100\%

Measures ego speed relative to limits or surrounding agents.

  • Comfortness: Penalty-based score decreasing with excessive jerk, lateral/longitudinal acceleration, and high yaw rates.
  • Multi-Ability Scores: SR is computed separately for each of the five core task types to produce a “Mean Ability” metric, exposing method strengths/weaknesses by traffic skill (Jia et al., 6 Jun 2024, Sun et al., 13 Jun 2025).
  • Open-loop L₂ error is also reported (mean waypoint deviation over fixed horizons), but is not predictive of closed-loop robustness across most methods (Jia et al., 6 Jun 2024, Guo et al., 21 May 2025).

Metric smoothing: The design of multiplicative penalties on short, scenario-focused routes reduces the compounding effect of single mistakes, leading to statistically more stable DS and SR values for fair model comparison.

4. Comparative Results and Methodological Impact

Bench2Drive has become the primary closed-loop benchmarking suite for SOTA E2E-AD, enabling reproducible and granular comparisons across paradigms such as discriminative, generative, hybrid diffusion, proposals, Mixture-of-Experts, and vision-language-action architectures.

Key results by method class (selected values from (Guo et al., 21 May 2025, Hu et al., 25 Nov 2025, Yin et al., 21 Nov 2025, Tang et al., 11 Mar 2025, Sun et al., 13 Jun 2025, Wan et al., 19 Jul 2025, Guo et al., 17 Oct 2025)):

Method Driving Score (DS) Success Rate (SR) EFF Comfort Mean Ability
AD-MLP 18.05 0% 48.45 22.63 0.87
UniAD-Base 45.81 16.36% 129.21 43.58 15.6
DriveTransformer 63.46 35.01% 100.64 20.78 38.6
iPad 65.02 35.91% 161.31 28.21 42.6
HiP-AD 86.77 69.09% 203.12 19.36 72.1
CogAD 48.30 24.00% 142.00 40.37
FocalAD 45.77 17.30% 174.01 20.53
GEMINUS 65.39 37.73% 37.77
RAP-ResNet 66.42 37.27% 165.47 23.63
DiffRefiner 87.1 71.4% 69.0
SimLingo 85.1 67.3% 259.2 33.7
VDRive 66.25 50.51% 110.23 22.90 45.65
ReasonPlan 64.01 34.55% 180.64

Notable findings:

5. Benchmark Impact: Analysis and Insights

Bench2Drive enables:

  • Ability disentanglement: Successes/failures can be traced to specific skills (e.g., overtaking 20%, merging 17%)—crucial for diagnosing distributional weaknesses hidden by aggregate metrics.
  • Statistical stability: By isolating scenarios and reducing route length, Bench2Drive avoids the heavy-tailed DS variance of Town05Long/Longest6 protocols (Jia et al., 6 Jun 2024).
  • Diversity: The inclusion of rare, interactive, or adversarial cases (e.g., occluded merges, signal failures) ensures methods cannot game the benchmark by “mode averaging” or conservative policies.
  • Reproducible closed-loop testing: Model-level improvement and cross-paper comparison become statistically meaningful given low metric variance and consistent protocols.

Bench2Drive’s initiative toward a comprehensive training resource—expert-collected, safety-filtered, and fully annotated—has fostered these advances. Studies have demonstrated that improvements in open-loop L₂ error do not necessarily translate into closed-loop DS/SR; only holistic planning-aware and scenario-aligned strategies show real driving gains (Guo et al., 21 May 2025, Sun et al., 13 Jun 2025).

6. Role in Research on Efficient and Robust Inference

Bench2Drive is a preferred benchmark for evaluating both method-level advances and inference-time efficiency enhancements—particularly dynamic early-exit and transformer-layer sparsity strategies. For example, DeeAD demonstrates that action-space–grounded early exit in VLA planners can achieve up to 29% latency reduction and 28% transformer-layer sparsity with negligible loss in trajectory accuracy or safety (Hu et al., 25 Nov 2025). RAP shows that feature-aligned rasterization augments training for robustness and long-tail generalization, with measurable improvements in DS and SR over camera-only methods (Feng et al., 5 Oct 2025).

Methods leveraging explicit scenario/skill-level expertization (GEMINUS, DriveMoE), adversarial policy refinement (EvaDrive), or semantic reasoning chains (ReasonPlan) all validate their closed-loop claims on Bench2Drive, underscoring its status as the definitive E2E-AD benchmark for both accuracy and practical deployability (Wan et al., 19 Jul 2025, Yang et al., 22 May 2025, Jiao et al., 5 Aug 2025, Liu et al., 26 May 2025).

7. Limitations and Future Directions

Bench2Drive’s coverage is restricted to CARLA simulation; real-world generalization (sim2real), adverse weather, or extreme corner cases may be underrepresented (Wang et al., 27 May 2025). The reliance on expert-generated data may limit exposure to long-tail perturbations (e.g., rare vehicle types, extreme pedestrian flows). Emerging directions include integration with neural rendering frameworks for mixed-reality benchmarking (Bench2Drive-R), broader multi-sensor fusion, and richer metric formulations beyond multiplicative infraction products (You et al., 11 Dec 2024).

Potential improvements include incorporating dynamic multi-agent reaction (via generative behavior models), fine-grained stratification of scenario difficulty, and additional real-world sensor noise. Several works suggest augmenting the penalty function (DS) to mitigate overly conservative or mode-collapsed agent behaviors.


Bench2Drive thus constitutes a rigorous and nuanced testbed for E2E-AD, providing the academic and industrial communities with a fair, comprehensive, and reproducible foundation for both capability benchmarking and algorithm development across a wide spectrum of driving-related tasks.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Bench2Drive.