SafeBench: Autonomous Driving Safety Benchmark

Updated 11 November 2025

SafeBench is a curated dataset and benchmarking platform that evaluates the safety of autonomous driving agents in simulation with rare, critical hazards.
It features a modular architecture integrated with advanced scenario generation algorithms like Learning-to-Collide and AdvSim to simulate realistic adversarial events.
Evaluation metrics span safety, functionality, and driving etiquette, revealing trade-offs among RL agents and guiding improvements in autonomous system robustness.

SafeBench is a series of datasets and benchmarking platforms that rigorously evaluate the safety properties of machine learning agents in complex, real-world and simulated environments. While several distinct datasets share the "SafeBench" moniker, the original and most widely recognized is "SafeBench: A Benchmarking Platform for Safety Evaluation of Autonomous Vehicles" (Xu et al., 2022). This benchmark is designed to provide rigorous, reproducible, and unified evaluation of end-to-end autonomous driving (AD) agents under adversity, with a focus on safety-critical situations that are rare in natural data but crucial for deployment readiness.

1. Platform Architecture and Scenario Taxonomy

SafeBench is constructed atop the CARLA simulator (v0.9.11), operating within Dockerized environments using ROS Noetic, and is partitioned into four modular nodes: Ego Vehicle Node, Agent Node, Scenario Node, and Evaluation Node. This architecture supports rapid substitution of vehicle sensor suites, agent policies, and perception stacks, providing flexible interoperability for research.

The scenario taxonomy encodes eight classes of "pre-crash" hazards based on NHTSA typology:

Straight Obstacle
Turning Obstacle
Lane Changing
Vehicle Passing
Red-light Running
Unprotected Left-turn
Right-turn
Crossing Negotiation

Each scenario class is instantiated over 10 geometrically diverse route variants (e.g., T-junctions, multilane arterials, differing signage, environmental conditions held fixed). Hazards are introduced by manipulating non-ego actors (vehicles, cyclists, pedestrians) to create realistic and transferable adversarial risk cases.

2. Scenario Generation Algorithms

SafeBench integrates four complementary scenario generation algorithms, each parameterizing hazards to maximize agent failure likelihood while respecting realism constraints:

Learning-to-Collide (LC): Black-box policy-gradient attack optimizing adversary spawn and trajectory for maximal collision probability with the ego agent using REINFORCE.
AdvSim (AS): Trajectory perturbation via Bayesian Optimization over kinematic bicycle models, focusing on nearby traffic actors’ motion.
Carla Scenario Generator (CS): Rule-based grid search over domain-knowledge constraints such as signal timing and crossing speeds.
Adversarial Trajectory Optimization (AT): Particle Swarm Optimization under explicit traffic-law constraints to generate legally plausible but adversarial vehicle interactions.

After initial generation of 3,140 scenario instances, a transfer-based selection step retains only those cases that induce failure in at least two of four baseline agents, resulting in 2,352 challenging, utility-rich scenarios.

Scenario Distribution Table

Scenario Type	Instances
Straight Obstacle	228
Turning Obstacle	164
Lane Changing	389
Vehicle Passing	389
Red-light Running	288
Unprotected Left-turn	319
Right-turn	284
Crossing Negotiation	291

3. Evaluation Metrics and Protocol

SafeBench defines a comprehensive suite of ten metrics evaluated per test trajectory $\tau$ :

$c(\tau)$ : Number of collisions
$r(\tau)$ : Red-light violations
$s(\tau)$ : Stop-sign violations
$d(\tau)$ : Off-road meters
$x(\tau)$ : Lateral path deviation
$p(\tau)$ : Percentage waypoints reached
$t(\tau)$ : Route completion time (conditional on $p(\tau)=100\%$ )
$acc(\tau)$ : Mean acceleration magnitude
$y(\tau)$ : Mean yaw rate
$l(\tau)$ : Number of lane invasions

Each metric $m^i$ is normalized (by $m^i_{\max}$ ) and integrated into an aggregate Overall Score (OS):

$OS = \sum_i w^i \cdot g(m^i),$

where $g(m^i)=m^i/m^i_\text{max}$ if higher is better, $g(m^i)=1-m^i/m^i_\text{max}$ otherwise, and $w^i$ is the domain-informed importance weight (e.g. collision rate weighted at 0.495). Scores are reported at the safety (collision/violation/off-road), functionality (route stability/completion/time), and etiquette (acceleration/yaw/lane) levels.

4. Baseline Performance and Trade-Offs

The benchmark includes four reference deep reinforcement learning policies:

Each agent is evaluated across four perception modalities (e.g., low-dimensional state, bird’s-eye view, camera) under both benign and adversarial settings. Salient findings include:

Algorithm (4D State)	Benign OS	Safety-critical OS
DDPG	0.603	0.498
SAC	0.833	0.497
TD3	0.830	0.518
PPO	0.819	0.622

Transitioning from benign to safety-critical scenarios induces over 20% absolute drop in OS for all agents.
SAC and TD3 achieve high benign performance but are brittle to adversarial inputs; PPO, with lower benign OS, maintains superior robustness under safety-critical conditions.
Algorithmic trade-offs are observed: e.g., SAC is fastest in route completion but exhibits the highest collision rates, while PPO achieves maximal completion but incurs more traffic-law violations.

A plausible implication is that evaluation on benign test routes is insufficient for safety validation; adversarial and rare event testing are critical.

5. Data Formats, Access, and Computational Considerations

SafeBench scenarios are distributed as ROS/YAML configuration files, with output logs in JSON and metrics in per-run CSVs. The platform is designed for cross-agent comparability; researchers can plug in arbitrary agent/controller modules and reproduce metrics under the exact same scenario and reporting protocol.

Key computational requirements:

Single NVIDIA GPU (e.g. RTX 3090)
~20 GB disk space (CARLA content)
10–15 hours of training per RL method (parallelizable)
~2 GB RAM per Docker container instance

Installation is streamlined via provided build/run scripts for the entire simulation and benchmark stack.

6. Application Scenarios and Limitations

SafeBench provides a unified testbed for benchmarking perception-to-control driving stacks, scenario generation methods, safe planning algorithms, and robust reinforcement learning policies. Its modular scenario generation facilitates ablation on both agent-side (policy, perception) and environment-side (scenario, route, hazard type) components.

Limitations include a fixed set of environmental configurations (Town03, weather held constant; though manual extensions possible), adversarial focus on pre-crash criticalities (not long-horizon risk accumulation), and dependence on simulated physics. The "best" agent on benign tasks is not necessarily safest in critical conditions; thus a practitioner must choose metrics and scenario types aligned with deployment hazards.

7. Extensions, Availability, and Research Impact

SafeBench is open-source and actively maintained at https://safebench.github.io. Researchers are encouraged to contribute new scenario templates, hazard generators, agent architectures, and perception modules. The platform has motivated parallel efforts in embodied task planning (Yin et al., 2024), vision-language home safety (Gao et al., 28 Sep 2025), industrial video-language safety (Abdullah et al., 1 Aug 2025), LLM safety benchmarking (Ying et al., 2024, Zhang et al., 2024, In et al., 20 Feb 2025), and offline safe RL (Liu et al., 2023), highlighting its foundational influence in evaluation-driven safety research.

By establishing rigorous, adversarially parameterized, and broadly accessible benchmarks, SafeBench enables the systematic quantification of safety margins for learning-based driving agents, supporting both academic innovation and practical progress toward certified safe autonomy.