- The paper presents a novel hierarchical safety benchmark for LLMs, integrating 6 domains, 16 tasks, and 65 categories.
- It employs both LLM-based and regex-based evaluation methods to assess model resilience across base, attack, and defense-enhanced datasets.
- Key experimental findings demonstrate superior performance of Claude2 and vulnerabilities in models like Gemini under adversarial conditions.
"SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for LLMs" (2402.05044)
Introduction
LLMs have achieved significant milestones in natural language understanding and generation. However, ensuring their safety is paramount due to the potential for generating harmful content. The "SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for LLMs" paper presents a novel benchmark, SALAD-Bench, designed to evaluate LLMs' safety, as well as attack and defense effectiveness. With a structured hierarchy and diverse question formats, this benchmark aims to address the limitations of prior safety evaluation benchmarks.
Figure 1: Illustration of question enhancement and evaluation procedures in SALAD-Bench, depicting multiple subsets and the evaluation focus on various safety metrics.
Benchmark Composition
SALAD-Bench introduces a hierarchical taxonomy spanning three levels: 6 domains, 16 tasks, and 65 categories, with at least 200 questions per category. This structure enables comprehensive evaluations across different safety dimensions, highlighting both strengths and vulnerabilities of LLMs.
Figure 2: SALAD-Bench's taxonomy with three levels and 65 categories focused on safety issues.
Dataset Characteristics
SALAD-Bench comprises:
- Base Set: Over 21,000 test samples for basic safety evaluation.
- Attack-Enhanced Subset: 5,000 questions to test model resilience against attacks.
- Defense-Enhanced Subset: 200 questions focused on evaluating the effectiveness of defense strategies.
- Multiple-Choice Questions (MCQ): 4,000 questions designed to assess safety decision-making capabilities.
Evaluation Methodology
The SALAD-Bench methodology includes both LLM-based and heuristic evaluation techniques. MD-Judge, an LLM-based evaluator, is utilized for assessing QA pairs, integrating task-level and domain-specific safety taxonomies. In contrast, MCQ-Judge utilizes regex parsing to efficiently evaluate multiple-choice questions.
Figure 3: An example of a training sample with domain-level taxonomy as the safety category.
The evaluation approach ensures an evidence-based and granular analysis of safety vulnerabilities across categories, informed by the structured hierarchy.
Experimental Setup and Results
The experiments involve multiple LLMs, categorized into open-sourced and black-box models. The benchmark explores the safety implications across different domains and tasks.
Figure 4: Safety rates at the domain levels for black-box LLMs using SALAD-Bench’s base set and attack-enhanced subset.
Key findings include:
- Claude2 achieved superior safety performance across both base and attack-enhanced sets, indicating robust safety guardrails.
- Gemini's significantly reduced performance in the attack-enhanced subset highlights underlying vulnerabilities to adversarial inputs.
Attack and Defense Strategies
A variety of attack methods were evaluated, including TAP, GPTFuzzer, and human-designed jailbreak prompts, each demonstrating unique efficacy profiles against open-source models.
Figure 5: Construction of the attack-enhanced dataset.
Conversely, defense methods like GPT-paraphrasing and Self-Reminder prompts exhibited varying success in mitigating unsafe behavior, with paraphrasing showing the most promise in maintaining safety.
Conclusion
SALAD-Bench serves as a comprehensive tool for safety evaluation, illuminating both the strengths and weaknesses of current LLMs in line with ensuring robust AI deployment. The benchmark highlights critical areas for future improvement, specifically the need for adaptive safety measures against evolving adversarial attacks. Continued research and enhancements in safety benchmarking remain essential to advance the reliability and trustworthiness of AI systems in high-stakes applications.