Papers
Topics
Authors
Recent
2000 character limit reached

Bench2Drive Autonomous Driving Benchmark

Updated 26 November 2025
  • Bench2Drive Benchmark is a standardized, closed-loop framework that evaluates end-to-end autonomous driving systems across varied scenarios, weather conditions, and urban layouts.
  • It utilizes a CARLA-based dataset with 2,000,000 annotated frames to ensure reproducible evaluations and detailed performance metrics for both macro- and meso-level tasks.
  • The suite enables hardware–software co-design by isolating model submodule bottlenecks and providing clear performance indicators such as throughput, latency, and energy per inference.

Bench2Drive is a standardized, closed-loop benchmarking suite for autonomous driving systems—particularly end-to-end autonomous driving (E2E-AD)—aimed at providing granular, multi-ability evaluation under diverse scenarios, weather conditions, and hardware configurations. It establishes a common data foundation, execution protocol, and metric system for comparing models, hardware accelerators, and integrated pipelines across both simulated and physical platforms.

1. Conceptual Framework and Objectives

Bench2Drive centers on fair, realistic, and multi-level testing of E2E-AD systems. Its primary goals are: (i) enabling reproducible, use-case-controlled comparisons for vision-driven autonomous driving tasks; (ii) disentangling model skill across 44 interactive scenarios and 23 weathers on 12 distinct urban layouts; (iii) exposing model weaknesses not captured by open-loop or long-route benchmarks; and (iv) facilitating hardware–software co-design via standardized performance indicators. The framework targets both system-on-chip (SoC) and configurable IP-core accelerators, as well as vision–LLMs and multi-modal large models when applicable (Jia et al., 6 Jun 2024).

2. Dataset, Scenario Coverage, and Protocol

Dataset Composition

Bench2Drive provides a CARLA-based dataset with 2,000,000 annotated frames from 10,000 short clips, sampled uniformly over:

  • 44 interactive scenarios: including cut-in, overtaking, detour, merging, yield, emergency braking, traffic sign compliance, parked obstacles, pedestrian crossings, highway merges, accident blocking, and sequential lane changes.
  • 23 weather variants: ranging from sunny, foggy, rainy, and overcast to night and rare conditions.
  • 12 towns: urban, village, campus, and highway layouts.

Each frame records multi-modal sensor signals: 1× LiDAR 64-beam, 6× forward RGB cameras (900×1600, JPEG quality=20), 5× radar, IMU/GNSS vehicle states, ground-truth 3D bounding boxes, semantic/instance segmentation, HD-maps, traffic light/sign states, and expert RL features.

Evaluation Routes

The core protocol comprises 220 closed-loop routes: 5 per scenario × scenario × weather × location, each ≈150 m, designed to guarantee skill-specific attribution and to reduce metric variance versus standard CARLA benchmarks.

Execution Steps

For each route, agents interface directly with raw sensor input, generate trajectory or low-level controls, and interact with dynamic traffic agents. Termination is triggered by infractions or successful route completion, within a fixed time.

3. Twofold Benchmark Granularity and Hardware Evaluation

Bench2Drive formally evaluates both macro-level (entire model/task pipeline) and meso-level (feature extractor submodule) performance:

  • Macro-level tasks: Semantic segmentation (FCN-8s + VGG-16â‚€.â‚‚â‚… backbone, 1920×1080); object detection (SSD + VGG-16â‚€.â‚‚â‚…, 1920×1080); action recognition (CNN-LSTM, 120×80 pedestrian patches).
  • Meso-level (submodule): Four architectures—VGG-16â‚€.â‚‚â‚… (vanilla conv), SqueezeNet (fire modules), MobileNet_v2 (depthwise/inverted residual), SparseNet-40 (dense blocks).

The meso-level protocol isolates submodule bottlenecks by benchmarking cutoff points (e.g., conv5_3 layer in VGG-16â‚€.â‚‚â‚…), allowing fine-grained cross-validation across heterogeneous hardware backends (Runge et al., 2020).

4. Performance Metrics and Reporting

Bench2Drive defines a robust set of indicators:

  • Throughput (images/sec):

Thr=NinfTtotal\mathrm{Thr} = \frac{N_{\mathrm{inf}}}{T_{\mathrm{total}}}

  • Latency (ms/inference):

Tlatency=tend−tstartT_{\mathrm{latency}} = t_{\mathrm{end}} - t_{\mathrm{start}}

  • Energy per inference (mJ):

E=Pavgâ‹…TlatencyE = P_{\mathrm{avg}} \cdot T_{\mathrm{latency}}

  • External memory footprint (MB): peak DRAM, weights+activations.
  • On-chip buffer (MB): largest SRAM allocation.
  • Hardware–Model Mismatch (%):

Mismatch=100%−Utilization\mathrm{Mismatch} = 100\% - \mathrm{Utilization}

with

Utilization=Measured TOPSPeak TOPS×100%\mathrm{Utilization} = \frac{\mathrm{Measured~TOPS}}{\mathrm{Peak~TOPS}} \times 100\%

  • CARLA Leaderboard metrics (closed-loop): Success Rate (SR), Driving Score (DS), route completion ratio (RC), efficiency, comfort, event-specific Skill Score.

System outputs and metrics undergo internal consistency and accuracy cross-validation for each run (Jia et al., 6 Jun 2024, Jia et al., 7 Mar 2025, Runge et al., 2020).

Sample Results (HWA Benchmarking)

Model Mode Thr (img/s) Lat (ms) E (mJ) Mem (MB) Mismatch (%)
VGG-16â‚€.â‚‚â‚… Unopt 38 26.3 12.5 45 42
VGG-16â‚€.â‚‚â‚… Optimized 72 13.9 8.1 30 22
SqueezeNet Unopt 52 19.2 9.8 25 35
SqueezeNet Optimized 95 10.5 6.3 18 15
MobileNet_v2 Unopt 45 22.2 11.7 28 48
MobileNet_v2 Optimized 80 12.5 7.5 20 28

(Runge et al., 2020)

Closed-Loop Model Performance (multi-ability breakdown)

Model Success Rate Driving Score Merging Overtaking Emergency Brake Mean Ability
AD-MLP 0.00% 18.05% 0.00 0.00 0.00 0.00
UniAD-Tiny 13.18% 40.73% 4.11 12.50 14.54 11.94
DriveAdapter* 30.71% 42.91% 29.23 20.00 34.71 38.23
DriveTransformer-LG 35.01% 63.46% 17.57 35.00 48.36 38.60
WoTE 31.36% 61.71% — — — —

(Jia et al., 6 Jun 2024, Jia et al., 7 Mar 2025, Li et al., 2 Apr 2025)

5. Architectural and Evaluation Insights

Bench2Drive provides several architectural and experimental findings:

  • Expert feature distillation markedly improves closed-loop skill performance.
  • Unified transformer architectures (task parallelism, sensor and temporal cross-attention) enable superior training stability and multi-scenario robustness (Jia et al., 7 Mar 2025).
  • Model-based trajectory evaluation via BEV world models delivers better real-time candidate scoring, enhancing SR and DS modestly over rule-based selection (Li et al., 2 Apr 2025).
  • Skill attribution: traffic sign compliance and rule-based give-way tasks are mastered more reliably than interactive maneuvers, which remain brittle even for strong baselines.
  • Open-loop L2 errors do not predict closed-loop driving performance, highlighting distribution shift and feedback sensitivity.

6. Hardware–Model Co-design Recommendations

Bench2Drive supports detailed accelerator selection and optimization guidelines:

  • Large conv nets (uniform compute) are best on vector accelerators with high SRAM.
  • SqueezeNet/fire modules map to engines with dedicated 1×1/3×3 conv.
  • Mobile/lightweight nets (depthwise separable) require channel-parallel architectures and flexible tiling.
  • Dense connectivity and multi-scale heads demand on-chip scratchpads with dynamic allocation for fast memory reuse; absence leads to bandwidth stalls.
  • LSTMs/recurrent nets need stateful operator support and irregular control flow compatibility (Runge et al., 2020).
  • Benchmark design mandates hybrid granularity, realistic resolutions, and comprehensive PI collection for cross-validation and mismatch analysis.

Bench2Drive forms the reference baseline for evaluating both hardware accelerator efficacy (original Bosch Deep Learning Hardware Benchmark), general edge-computing platforms (CAVBench) (Wang et al., 2018), and emerging frameworks integrating hierarchical capability assessment, hardware–software adaptation, and closed-loop MLLM/VLM evaluation (Zhang et al., 4 Aug 2025, Wei et al., 11 Jun 2025, You et al., 11 Dec 2024). Its principles—multi-granularity, closed-loop skill isolation, and fair data standardization—have been widely adopted in subsequent benchmarks for autonomous driving.

Bench2Drive thus establishes rigorous, multi-skill, hardware-sensitive benchmarking for autonomous driving, making it central to evaluating real-world readiness and catalyzing advances across model architectures, optimization toolchains, and embedded accelerator design.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Bench2Drive Benchmark.