SafeBench: AI Safety Benchmark
- SafeBench is a collection of benchmarks providing rigorous safety evaluations for complex AI systems, including LLMs, multimodal models, and autonomous driving agents.
- Its evaluation protocols combine automated and human-guided methods to assess metrics like attack success rate, safety risk index, and refusal matching.
- The framework facilitates comparative analysis of model vulnerabilities and inspires advances in scenario generation, red-teaming, and safety-oriented architectural innovation.
SafeBench
SafeBench refers to a cluster of benchmarks and frameworks—each independently introduced under the "SafeBench" name—for rigorous and systematic safety evaluation in complex AI systems, including LLMs, vision-LLMs (LVLMs), multimodal LLMs (MLLMs), embodied agents, and autonomous driving stacks. The term is also attached to culture- and language-specific test suites (e.g., SafeBench-fa for Persian LLMs) and widely cited as the canonical challenge set for vision-LLM jailbreak testing. SafeBench frameworks are characterized by scenario-diverse datasets, fine-grained taxonomies of harm, and multi-dimensional, often partially automated, evaluation protocols which probe both system-level refusal and nuanced behavioral vulnerabilities.
1. Benchmark Variants, Motivations, and Scope
"SafeBench" benchmarks cluster into several domains:
- Autonomous driving systems: The original SafeBench (2022) (Xu et al., 2022) is a modular ROS/CARLA-based platform for testing reinforcement learning (DRL) agents in adversarial and knowledge-based safety-critical scenarios. It allows standardized comparison of scenario-generation algorithms, input modalities, and RL policies under diverse traffic hazards.
- Multimodal LLMs and LVLMs: SafeBench (Ying et al., 24 Oct 2024, Geng et al., 31 May 2025, Zou et al., 29 Jul 2025) extends the concept to MLLMs, supporting image, text, and audio queries and evaluating refusal rates or attack success rates under a diverse, taxonomy-driven suite of harmful instructions that probe policy-violating behaviors across modalities.
- Language, culture, and user specificity: SafeBench-fa (Pourbahman et al., 17 Apr 2025) evaluates Persian LLMs using synthetically generated, culture- and taboo-specific queries. Similarly, U-SafeBench (In et al., 20 Feb 2025) introduces user-tailored safety: instructions that are harmless to most but unsafe for specific health, mental, or criminal backgrounds.
- Dialog and reasoning models: Sub-variants such as SafeDialBench (Cao et al., 16 Feb 2025) and SafeRBench (Gao et al., 19 Nov 2025) assess LLM and large reasoning model (LRM) safety under multi-turn dialog, chain-of-thought, or complex planning tasks, tracking not only output refusal but also in-trace emergence of unsafe rationales.
These frameworks fill gaps left by earlier benchmarks—limited scope, narrow modality coverage, lack of adversarial pressure, or trivial input/output formats—by exposing context-sensitive failure modes relevant to real-world deployment.
2. Dataset Construction and Harm Taxonomies
Every SafeBench variant builds on a rigorously defined harm taxonomy, scenario set, and data curation pipeline:
- Autonomous driving SafeBench (Xu et al., 2022): 2,352 filtered scenarios are generated over 8 templates (e.g., Lane Change, Red-light Running), covering variations in road topology, dynamic actors, weather, and signage. Adversarial and knowledge-based generation methods ensure both broader coverage and sharper policy testing.
- MLLM/LVLM SafeBench (Ying et al., 24 Oct 2024, Zou et al., 29 Jul 2025, Geng et al., 31 May 2025): Datasets span 500 (text-only), 2,300 (multimodal), or more query–scenario pairs, each mapped to 8–10 major risk categories (Medical, Confidential Info, Cybersecurity, Ethics, Pornography, Criminal Activities, Violence, Speech/Bias). Harm labels and query generation are LLM-assisted but filtered and ranked by cross-model or human judges.
- Persian SafeBench-fa (Pourbahman et al., 17 Apr 2025): Queries are generated for six major safety topics (Violence, Unlawful Conduct, Harms to Minors, Adult Content, Mental Health, Privacy). Sub-topics and question phrasing are adapted to Persian cultural and legal sensitivities, with each example verified by a local-language annotator.
- User-specific U-SafeBench (In et al., 20 Feb 2025): Profiles (157) are defined, each with associated tailored harmful instructions, yielding ~2,000 (profile, instruction, label) triples.
Taxonomy design often merges external risk guidelines (e.g., OpenAI/Meta safety manuals, OWASP Top-10) with LLM-judged, application-inspired subcategories, and is typically refined to ensure coverage and realism.
3. Evaluation Protocols and Metrics
SafeBench evaluation combines automated and, where needed, collaborative human/LLM judgment strategies:
- Autonomous driving (Xu et al., 2022): Evaluation is multi-level, logging collision rate (CR), red-light violations (RR), route following stability (RF), acceleration (ACC), lane invasions (LI), with overall scores (OS) as weighted sums (∑_i wi g(mi)) prioritizing safety-critical failures.
- MLLM/LVLM SafeBench (Ying et al., 24 Oct 2024, Zou et al., 29 Jul 2025, Geng et al., 31 May 2025): Primary metrics include:
- Attack Success Rate (ASR): Fraction of attempts producing an unsafe response (as independently judged by LLM “jury” or taxonomy-based post-processing).
- Safety Risk Index (SRI): Mean threat level per query/response, normalized (0–100).
- Refusal Matching / ARC metrics (Geng et al., 31 May 2025): ARC_a (fraction of items where model produces actionable harmful content); ARC_r (fraction recognizing and responding in any form).
- Jury Deliberation Protocol (Ying et al., 24 Oct 2024): Five LLM jurors, each role-primed, independently assess, deliberate, and combine their judgments by majority vote and averaged severity.
- Dialogue/task planning (Cao et al., 16 Feb 2025, Yin et al., 17 Dec 2024, Gao et al., 19 Nov 2025):
- SafeDialBench: Reports precision, recall, and F1 for unsafe content detection; safety score S_total combines detection, handling, and consistency metrics.
- SafeAgentBench (Yin et al., 17 Dec 2024): Tracks plan rejection rates (fraction of instructions explicitly refused), goal-condition and semantic success rates, execution fraction, and time per task.
- SafeRBench (Gao et al., 19 Nov 2025): Segments reasoning traces into micro-thoughts, computes risk density, defense density, refusal, intention awareness, safe strategy conversion, answer risk/execution level, response complexity, and trajectory coherence.
Each protocol explicitly balances safety (refusal, risk detection) against helpfulness and task completeness, with composite scores (e.g., harmonic means in U-SafeBench) to discourage trivial always-refuse or always-fulfill policies.
4. Empirical Results and Model Vulnerabilities
Benchmarking on SafeBench variants consistently exposes significant gaps between SOTA models' claimed alignment and actual risk performance:
- MLLMs/LVLMs (Ying et al., 24 Oct 2024, Geng et al., 31 May 2025, Zou et al., 29 Jul 2025):
- Open-source models (e.g., LLaVA, Qwen-VL) can be jailbroken with ASR ≈ 0.80–0.92 (PRISM framework), and audio-LLMs (Qwen-Audio) reach ARC_a = 77.6%. Even after defense interventions, attack success remains high unless aggressive input perturbation or multi-step domain adaptation is applied.
- Commercial models (GPT-4o, Claude) perform better (ASR down to 0.7–3.4%), but still exhibit non-trivial vulnerability, especially when textual and image/audio modalities are combined.
- Autonomous driving (Xu et al., 2022):
- DRL agents optimized for benign settings experience major OS drops when evaluated on safety-critical scenarios (e.g., PPO: 0.819 → 0.622).
- Scenario generator AT achieves post-selection CR = 0.811, exposing transferability vulnerabilities.
- Dialog/chain-of-thought/user-specific (Cao et al., 16 Feb 2025, In et al., 20 Feb 2025, Gao et al., 19 Nov 2025):
- No LLM is robust across all dimensions; safety and consistency under diverse jailbreaking is a persistent failure mode (e.g., best S_total in SafeDialBench ≈ 0.81; lowest ≈ 0.68).
- In U-SafeBench, mean safety S = 18.6%, with best model (Claude-3.5-sonnet) S = 63.8%—far below general benchmarks.
- Abstract/hard-to-formalize instructions (e.g., long-horizon plans, disguised harmful requests) drive down safety across all agent types.
- Persian language/cultural context (Pourbahman et al., 17 Apr 2025):
- Gemma-2-9B-it achieves SafeBench-fa score 95.58, but even state-of-the-art models underperform in mental-health and privacy subdomains, indicative of incomplete culture-specific safety training.
Failure analysis consistently identifies failure to recognize contextual risk, over-reliance on syntactic cues, limited scenario abstraction, and persistent compliance with multi-step compositional attacks (e.g., visual gadgets, role-play, or indirect instruction embedding).
5. Methodological and Architectural Innovations
Key advances in SafeBench frameworks include:
- Iterative, LLM-guided data generation: LLMs serve as both scenario generators and preliminary annotators; human or cross-model validation corrects for false positives and scenario drift (Ying et al., 24 Oct 2024, Cao et al., 16 Feb 2025).
- Collaborative evaluation and jury mechanisms: Evaluation via multiple independently-primed LLMs improves alignment with human judgments (e.g., Cohen’s κ up to 0.89), reduces evaluation bias, and enables majority consensus on subtle cases (Ying et al., 24 Oct 2024).
- Micro-thought/chunking analysis of chain-of-thought: SafeRBench (Gao et al., 19 Nov 2025) tracks risk emergence and protective mechanisms at intermediate reasoning steps, revealing “cliff-edge” vulnerabilities and rationale laundering that would be missed by output-only scoring.
- Multi-perspective risk modeling: SafeToolBench (Xia et al., 9 Sep 2025) integrates nine scoring dimensions (user instruction, tool internals, joint intent-tool match), increasing risk detection in complex tool-use scenarios, and shows that the removal of any modeling perspective drops overall safety by 6–11 pp.
- Cross-modal adversarial adaptation: Conversion of text-only queries into image/audio or mixed-modality challenges reveals an overreliance of model safety mechanisms on textual refusal patterns (Geng et al., 31 May 2025, Zou et al., 29 Jul 2025).
6. Limitations, Open Problems, and Future Research Directions
Limitations repeatedly cited include:
- Scenario coverage: Sample sizes (e.g., 100 per subcategory in SafeBench (Ying et al., 24 Oct 2024)) may not span the full space of emergent or evolving attacks.
- Modal exclusivity: Early SafeBench versions are text-only or lack combinations of modalities; only recent frameworks systematically challenge MLLMs across text, image, and audio inputs (Ying et al., 24 Oct 2024).
- Realism and abstraction: Abstract, ambiguous, or multi-turn attacks are underrepresented—most LLMs struggle when vulnerable content is only gradually revealed or logically obfuscated (Cao et al., 16 Feb 2025, Gao et al., 19 Nov 2025).
- False positive/negative balancing: Over-refusal penalties and safe/benign sample inclusions are under-explored, risking models that “play it safe” but are functionally useless or, conversely, miss subtle context-specific hazards (In et al., 20 Feb 2025, Pourbahman et al., 17 Apr 2025).
- Generalization: Scenario generators and evaluation modules remain “in-the-loop” tested; fully formal guarantees on risk or optimal coverage remain an open research avenue.
Key future directions announced by various SafeBench creators include expanding dynamic/adaptive red-teaming, supporting continuous scenario and modal augmentation, refining risk stratification, and designing defense frameworks that operate at the reasoning and planning trace, not just input prompt or output phases (Ying et al., 24 Oct 2024, Zou et al., 29 Jul 2025, Gao et al., 19 Nov 2025, Xu et al., 2022).
7. Comparative Position and Impact
SafeBench frameworks, across their variants and domains, have become foundational reference points:
- For MLLMs: SafeBench (Ying et al., 24 Oct 2024, Geng et al., 31 May 2025) is the standard for evaluating and red-teaming new jailbreak and defense techniques. Comparative benchmarks enable researchers to rapidly quantify relative vulnerabilities, including trade-offs between parameter size, architecture, training corpus, and multi-modal integration.
- For autonomous and embodied agents: SafeBench (Xu et al., 2022, Yin et al., 17 Dec 2024) enables reproducible comparison of RL and LLM-based planning stacks in safety-critical contexts, informing deployment and certification discussions.
- For cross-cultural and user-sensitive LLMs: SafeBench-fa (Pourbahman et al., 17 Apr 2025) and U-SafeBench (In et al., 20 Feb 2025) have highlighted the failure of global models to handle local taboos or individual vulnerabilities, motivating culturally and personally adapted fine-tuning pipelines.
By emphasizing structured taxonomies, transparent annotation and evaluation, and support for adversarial, compositional, and contextually diverse scenarios, SafeBench sets the prevailing methodology for safety assessment in complex ML-driven systems. Its multi-dimensional evaluation protocols and transparent reporting have also driven rapid comparative research on architectural, prompting, and defense advances in safety-aware AI.