Papers
Topics
Authors
Recent
Search
2000 character limit reached

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

Published 7 Feb 2024 in cs.CL, cs.AI, cs.CR, and cs.LG | (2402.05044v4)

Abstract: In the rapidly evolving landscape of LLMs, ensuring robust safety measures is paramount. To meet this crucial need, we propose \emph{SALAD-Bench}, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.SALAD-Bench is crafted with a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications and multiple-choice. To effectively manage the inherent complexity, we introduce an innovative evaluators: the LLM-based MD-Judge for QA pairs with a particular focus on attack-enhanced queries, ensuring a seamless, and reliable evaluation. Above components extend SALAD-Bench from standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring the joint-purpose utility. Our extensive experiments shed light on the resilience of LLMs against emerging threats and the efficacy of contemporary defense tactics. Data and evaluator are released under https://github.com/OpenSafetyLab/SALAD-BENCH.

Citations (54)

Summary

  • The paper presents a novel hierarchical safety benchmark for LLMs, integrating 6 domains, 16 tasks, and 65 categories.
  • It employs both LLM-based and regex-based evaluation methods to assess model resilience across base, attack, and defense-enhanced datasets.
  • Key experimental findings demonstrate superior performance of Claude2 and vulnerabilities in models like Gemini under adversarial conditions.

"SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for LLMs" (2402.05044)

Introduction

LLMs have achieved significant milestones in natural language understanding and generation. However, ensuring their safety is paramount due to the potential for generating harmful content. The "SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for LLMs" paper presents a novel benchmark, SALAD-Bench, designed to evaluate LLMs' safety, as well as attack and defense effectiveness. With a structured hierarchy and diverse question formats, this benchmark aims to address the limitations of prior safety evaluation benchmarks. Figure 1

Figure 1: Illustration of question enhancement and evaluation procedures in SALAD-Bench, depicting multiple subsets and the evaluation focus on various safety metrics.

Benchmark Composition

SALAD-Bench introduces a hierarchical taxonomy spanning three levels: 6 domains, 16 tasks, and 65 categories, with at least 200 questions per category. This structure enables comprehensive evaluations across different safety dimensions, highlighting both strengths and vulnerabilities of LLMs. Figure 2

Figure 2: SALAD-Bench's taxonomy with three levels and 65 categories focused on safety issues.

Dataset Characteristics

SALAD-Bench comprises:

  • Base Set: Over 21,000 test samples for basic safety evaluation.
  • Attack-Enhanced Subset: 5,000 questions to test model resilience against attacks.
  • Defense-Enhanced Subset: 200 questions focused on evaluating the effectiveness of defense strategies.
  • Multiple-Choice Questions (MCQ): 4,000 questions designed to assess safety decision-making capabilities.

Evaluation Methodology

The SALAD-Bench methodology includes both LLM-based and heuristic evaluation techniques. MD-Judge, an LLM-based evaluator, is utilized for assessing QA pairs, integrating task-level and domain-specific safety taxonomies. In contrast, MCQ-Judge utilizes regex parsing to efficiently evaluate multiple-choice questions. Figure 3

Figure 3: An example of a training sample with domain-level taxonomy as the safety category.

The evaluation approach ensures an evidence-based and granular analysis of safety vulnerabilities across categories, informed by the structured hierarchy.

Experimental Setup and Results

The experiments involve multiple LLMs, categorized into open-sourced and black-box models. The benchmark explores the safety implications across different domains and tasks. Figure 4

Figure 4: Safety rates at the domain levels for black-box LLMs using SALAD-Bench’s base set and attack-enhanced subset.

Key findings include:

  • Claude2 achieved superior safety performance across both base and attack-enhanced sets, indicating robust safety guardrails.
  • Gemini's significantly reduced performance in the attack-enhanced subset highlights underlying vulnerabilities to adversarial inputs.

Attack and Defense Strategies

A variety of attack methods were evaluated, including TAP, GPTFuzzer, and human-designed jailbreak prompts, each demonstrating unique efficacy profiles against open-source models. Figure 5

Figure 5: Construction of the attack-enhanced dataset.

Conversely, defense methods like GPT-paraphrasing and Self-Reminder prompts exhibited varying success in mitigating unsafe behavior, with paraphrasing showing the most promise in maintaining safety.

Conclusion

SALAD-Bench serves as a comprehensive tool for safety evaluation, illuminating both the strengths and weaknesses of current LLMs in line with ensuring robust AI deployment. The benchmark highlights critical areas for future improvement, specifically the need for adaptive safety measures against evolving adversarial attacks. Continued research and enhancements in safety benchmarking remain essential to advance the reliability and trustworthiness of AI systems in high-stakes applications.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 0 likes about this paper.