Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 89 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 98 tok/s Pro
GPT OSS 120B 424 tok/s Pro
Kimi K2 164 tok/s Pro
2000 character limit reached

SALAD-Bench: LLM Safety Benchmark

Updated 10 September 2025
  • SALAD-Bench is a large-scale safety benchmark that systematically evaluates LLM risks by categorizing harms using a hierarchical taxonomy.
  • It incorporates extensive datasets and adversarially enhanced queries to stress-test LLMs, measuring metrics like attack success rate and safety scores.
  • Automated tools like MD-Judge enable reproducible, scalable evaluations of attack and defense strategies, guiding improvements in LLM safety.

SALAD-Bench is a large-scale, multi-dimensional safety benchmark designed for comprehensive evaluation of LLMs, attack strategies, and defense mechanisms. Developed to address the increasing complexity and diversity of model threats and mitigation strategies in contemporary LLMs, SALAD-Bench establishes a hierarchical taxonomy and an integrated evaluation protocol that systematically measures both intrinsic safety and adversarial resilience across the spectrum of harms posed by autonomous language generation systems (Li et al., 7 Feb 2024).

1. Framework Overview and Motivation

SALAD-Bench was conceived in response to the limitations of traditional LLM safety benchmarks, which often cover narrow aspects of harmful behavior, rely on non-systematic sampling criteria, or lack principled mechanisms for adversarial probing. The framework generalizes safety benchmarking by introducing (i) expansive question sets across numerous harm types and manipulation strategies, (ii) a hierarchical taxonomy to facilitate granular risk attribution, and (iii) automated LLM-based evaluators (MD-Judge and MCQ-Judge) for standardized, scalable, and reproducible label assignment.

Fundamentally, SALAD-Bench functions not just as a static benchmark, but as a dual-purpose platform for both adversarial stress-testing (via attack-enhanced queries) and for quantitative defense evaluation, all within a unified, reproducible experimental environment.

2. Hierarchical Taxonomy: Domains, Tasks, Categories

At the core of SALAD-Bench is a tri-level taxonomy:

  • Domains (6 total): These top-level constructs capture broad classes of harm, such as Representation & Toxicity, Misinformation, Information Safety, Malicious Use, Human Autonomy Integrity, and Socioeconomic Harms.
  • Tasks (16 total): Each domain is subdivided into specific evaluative tasks. E.g., Representation/Toxicity includes toxic content, unfair representation, and adult content.
  • Categories (65 total): Fine-grained categories allow for micro-level disaggregation. For toxicity, examples include hate speech, harassment, violence, child abuse, etc.

This structure facilitates targeted analysis of safety vulnerabilities and enables differential performance assessment across the harm topology.

Level Count Example
Domain 6 Misinformation Harms, Malicious Use
Task 16 Toxic Content, Unfair Representation
Category 65 Hate Speech, Violence, Adult Content, Fraud

The taxonomy is explicitly incorporated into the question format and the MD-Judge input, allowing for context-stratified safety scoring and analysis.

3. Dataset Construction and Transformation Functionalities

SALAD-Bench includes a meticulously curated question dataset, constructed from open data sources and self-instructed queries, with systematic enrichment via attack and defense transformations. Question types include:

  • Base questions: Standard prompts addressing direct safety, instruction-following, and knowledge behaviors.
  • Attack-enhanced questions: Augmented via adversarial techniques, including automated attacks (gradient-based, suffix attachment), human-designed jailbreaks, TAP/GPTFuzzer, and red-teaming procedures. These expose latent failure modes in LLMs.
  • Defense-enhanced questions: Curated to challenge attack strategies, including examples with paraphrasing and self-reminder prompts.
  • Multiple-choice (MCQ): Deployed to evaluate consistency in safe/unsafe class assignments as well as instruction adherence.

Transformation workflows involve base response collection, filtration (e.g., via rejection rate and keyword match criteria), and adversarial augmentation with explicit methods. Attack Success Rate (ASR) is defined by the formula

ASR=1(Safety Rate)\mathrm{ASR} = 1 - (\text{Safety Rate})

quantifying the prevalence of unsafe completions in response to adversarial prompts.

4. LLM-Based Evaluation: MD-Judge

SALAD-Bench implements an automated evaluation protocol using MD-Judge, a fine-tuned LLM based (e.g., Mistral-7B) on a corpus of QA pairs encompassing the entire safety taxonomy and manipulation spectrum. Key aspects include:

  • Ingestion of taxonomy metadata as structured context.
  • A templated evaluation format generating multi-dimensional labels for each QA pair (see Figure 1 in the source for schema).
  • Output of safety scores and "unsafe probability" estimates, which are then aggregated for system-level and taxonomy-level evaluation.
  • Automation replaces manual, costly annotation pipelines and improves reproducibility across large-scale experiments.

MD-Judge, along with MCQ-Judge for multiple-choice questions, produces safety labels enabling batch computation of metrics such as F1 score, safety rate, and ASR, under both standard and adversarial conditions.

5. Experimental Results and Insights

Experiments span a wide array of leading LLMs (GPT-4, Claude2, GPT-3.5, Llama-2, Vicuna, etc.) against both base and attack-enhanced subsets:

  • Most LLMs maintain high safety rates in base cases but show pronounced vulnerability (elevated ASR) in adversarially manipulated (attack-enhanced) conditions.
  • Elo ratings calculated on both base and attack sets reveal relative resilience, with models like Claude2 and GPT-4 outperforming others in adversarial contexts.
  • Decomposition by taxonomy levels demonstrates model-specific weak points, e.g., certain LLMs excel in Representation toxicity mitigation but underperform in Information Safety or Malicious Use.
  • Defensive interventions (e.g., paraphrasing and reminder prompts) effect measurable drops in ASR, suggesting practical, scalable defense mechanisms that maintain helpfulness while suppressing abuse vectors.

6. Benchmark Utility and Practical Application

SALAD-Bench enables systematic, fine-grained evaluation of LLM safety, resilience to attack, and defense effectiveness in a joint platform. Researchers may:

  • Quantitatively compare models for safety performance overall and per taxonomy axis.
  • Test attack and defense methodologies within a controlled, extensible framework.
  • Generate actionable insights for model improvement and safety alignment strategies.

The public release at https://github.com/OpenSafetyLab/SALAD-BENCH includes dataset, evaluation scripts, the MD-Judge model, and supporting code for full reproducibility.

7. Future Directions

The benchmark is positioned for ongoing extension and deeper adversarial stress testing. Prospective research avenues include:

  • Finer resolution of new harm domains with evolving LLM applications.
  • Expansion to multilingual and multimodal tasks.
  • Development and systematization of novel attack/defense strategies with automated benchmarking protocols.
  • Enhanced taxonomic granularity for regulatory and compliance use-cases.

SALAD-Bench establishes a template for rigorous LLM safety verification and adaptation under the dynamic landscape of generative language technology.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)