JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models (2404.01318v5)

Published 28 Mar 2024 in cs.CR and cs.LG

Abstract: Jailbreak attacks cause LLMs to generate harmful, unethical, or otherwise objectionable content. Evaluating these attacks presents a number of challenges, which the current collection of benchmarks and evaluation techniques do not adequately address. First, there is no clear standard of practice regarding jailbreaking evaluation. Second, existing works compute costs and success rates in incomparable ways. And third, numerous works are not reproducible, as they withhold adversarial prompts, involve closed-source code, or rely on evolving proprietary APIs. To address these challenges, we introduce JailbreakBench, an open-sourced benchmark with the following components: (1) an evolving repository of state-of-the-art adversarial prompts, which we refer to as jailbreak artifacts; (2) a jailbreaking dataset comprising 100 behaviors -- both original and sourced from prior work (Zou et al., 2023; Mazeika et al., 2023, 2024) -- which align with OpenAI's usage policies; (3) a standardized evaluation framework at https://github.com/JailbreakBench/jailbreakbench that includes a clearly defined threat model, system prompts, chat templates, and scoring functions; and (4) a leaderboard at https://jailbreakbench.github.io/ that tracks the performance of attacks and defenses for various LLMs. We have carefully considered the potential ethical implications of releasing this benchmark, and believe that it will be a net positive for the community.

Citations (60)

View on Semantic Scholar

Summary

The paper introduces JailbreakBench, a comprehensive benchmark offering a standardized framework and dataset to evaluate LLM vulnerabilities against jailbreaking attacks.
It details the JBB-Behaviors dataset with 100 misuse behaviors across ten categories and an evolving repository of adversarial prompts to enhance reproducibility.
Initial findings reveal significant LLM vulnerabilities and varied defense effectiveness, underscoring the need for stronger safety mechanisms in future models.

Introducing JailbreakBench: A Comprehensive Benchmark for Assessing the Robustness of LLMs against Jailbreaking Attacks

Overview of JailbreakBench

JailbreakBench addresses the critical challenge of evaluating jailbreak attacks on LLMs, which generate harmful or otherwise objectionable content in response to adversarial prompts. This benchmark stands out by offering a standardized evaluation framework that includes a new jailbreaking dataset (JBB-Behaviors), a repository of state-of-the-art adversarial prompts (jailbreak artifacts), and a clearly defined threat model alongside system prompts and scoring functions. The developers of JailbreakBench have established a leaderboard to track advancements in both attacking and defending LLMs, fostering a competitive and collaborative environment for researchers.

The JailbreakBench Dataset and Repository

JailbreakBench introduces the JBB-Behaviors dataset, featuring 100 unique misuse behaviors across ten categories derived from OpenAI's usage policies. This dataset facilitates a comprehensive examination of LLM vulnerabilities across a wide spectrum of harmful content. Accompanying the dataset is an evolving repository of adversarial prompts, addressing the issue of reproducibility that plagues current jailbreaking research. This openly accessible repository is crucial for the development of more robust LLM defenses.

Standardized Evaluation Framework

A key contribution of JailbreakBench is its standardized evaluation framework, designed for accessibility and extensibility. The framework simplifies the process of benchmarking LLM robustness against jailbreaking attacks, accommodating various attack methodologies, including black-box, white-box, and adaptive attacks. Moreover, the leaderboard hosted on the JailbreakBench website offers a transparent platform for comparing the effectiveness of different approaches in real-time.

Initial Findings from JailbreakBench

Initial experiments conducted using JailbreakBench revealed noteworthy insights into the current state of LLM robustness and the effectiveness of existing defenses. The results highlighted the susceptibility of both open- and closed-source LLMs to jailbreaking attacks, with certain models demonstrating greater vulnerability. Conversely, defenses such as SmoothLLM and perplexity filtering showed promise in mitigating these attacks, albeit with varying degrees of success.

Theoretical and Practical Implications

The theoretical significance of JailbreakBench lies in its methodical approach to assessing LLM robustness, offering a standardized metric for evaluating the sophistication of jailbreak attacks and the effectiveness of defenses. Practically, JailbreakBench serves as a crucial tool for researchers and developers, guiding the enhancement of LLM safety features and the formulation of more robust defense mechanisms. Future developments in LLM technology and jailbreaking methodologies will undoubtedly benefit from the insights gained through this benchmark.

Speculations on Future AI Developments

Looking ahead, JailbreakBench is positioned to adapt and expand in response to the evolving landscape of LLM research. The incorporation of new attack and defense strategies, along with the introduction of updated models, will ensure the benchmark remains relevant and valuable. Moreover, the ethical considerations and responsible disclosure practices associated with JailbreakBench underscore the commitment of the AI research community to advancing LLM technologies in a manner that prioritizes safety and integrity.

Conclusion

JailbreakBench represents a significant step forward in the systematic evaluation of LLM robustness against jailbreaking attacks. By providing a comprehensive and reproducible benchmark, it paves the way for future advancements in LLM technologies, ensuring they are equipped to handle the myriad challenges posed by adversarial inputs. As the field continues to grow, the contributions of JailbreakBench will undoubtedly serve as a foundation for the development of safer, more reliable LLMs.

Related Papers

Tweets

https://twitter.com/patrickrchao/status/1775567668520616060

https://twitter.com/maksym_andr/status/1803103313790984633

https://twitter.com/maksym_andr/status/1792162239429956020

https://twitter.com/dpaleka/status/1779904689955635235

https://twitter.com/topofmlsafety/status/1775534721986077002

https://twitter.com/FSFG/status/1813708343757676562

YouTube

Show All Videos

Reddit

JailbreakBench is an LLM jailbreak benchmark with a dataset for jailbreaking behaviors, collection of adversarial prompts, and a leaderboard for tracking the performance of attacks and defenses on language models. (3 points, 0 comments)