A StrongREJECT for Empty Jailbreaks (2402.10260v2)

Published 15 Feb 2024 in cs.LG, cs.CL, and cs.CR

Abstract: Most jailbreak papers claim the jailbreaks they propose are highly effective, often boasting near-100% attack success rates. However, it is perhaps more common than not for jailbreak developers to substantially exaggerate the effectiveness of their jailbreaks. We suggest this problem arises because jailbreak researchers lack a standard, high-quality benchmark for evaluating jailbreak performance, leaving researchers to create their own. To create a benchmark, researchers must choose a dataset of forbidden prompts to which a victim model will respond, along with an evaluation method that scores the harmfulness of the victim model's responses. We show that existing benchmarks suffer from significant shortcomings and introduce the StrongREJECT benchmark to address these issues. StrongREJECT's dataset contains prompts that victim models must answer with specific, harmful information, while its automated evaluator measures the extent to which a response gives useful information to forbidden prompts. In doing so, the StrongREJECT evaluator achieves state-of-the-art agreement with human judgments of jailbreak effectiveness. Notably, we find that existing evaluation methods significantly overstate jailbreak effectiveness compared to human judgments and the StrongREJECT evaluator. We describe a surprising and novel phenomenon that explains this discrepancy: jailbreaks bypassing a victim model's safety fine-tuning tend to reduce its capabilities. Together, our findings underscore the need for researchers to use a high-quality benchmark, such as StrongREJECT, when developing new jailbreak attacks. We release the StrongREJECT code and data at https://strong-reject.readthedocs.io/en/latest/.

Citations (30)

View on Semantic Scholar

Summary

The paper introduces the StrongREJECT benchmark to evaluate jailbreak effectiveness accurately by assessing both refusal quality and response specificity.
It employs a comprehensive question set across six misuse categories, revealing that many current benchmarks overstate jailbreak success while compromising overall model performance.
Empirical results on GPT-4 demonstrate a significant accuracy drop post-jailbreak, stressing an urgent need for robust defenses in AI safety and policy-making.

Assessing the Robustness and Pitfalls of Jailbreak Detection in LLMs

Introduction

Recent advancements in the domain of LLMs have spotlighted a concerning trend—the rise of model "jailbreaks". These techniques aim to bypass the ethical constraints imposed on LLMs, potentially enabling misuse. In light of this, the paper by Alexandra Souly and colleagues presents a critical intervention: the StrongREJECT benchmark. This novel evaluation framework targets the existing inadequacies in assessing jailbreak methods, focusing on the often-overlooked aspect—response quality. By offering a meticulously curated question set and a distinctive autograding algorithm, StrongREJECT aims to furnish a more nuanced understanding of jailbreak efficacy.

Jailbreak Evaluation Methods: A Critique

Existing benchmarks suffer from two primary shortcomings: unsuitable question sets and biased grading methods. These constraints often paint an inflated picture of jailbreak success. The issue is compounded by strategies that, while aiming to jailbreak a model, inadvertently undermine its response quality—even on benign tasks. The paper's empirical work, demonstrated through the GPT-4 model's performance on the Massive Multitask Language Understanding (MMLU) benchmark, underscores this problem. The findings reveal a staggering decrease in accuracy post-jailbreak application, underscoring the critical trade-off at play.

StrongREJECT: A New Paradigm

The StrongREJECT benchmark addresses existing gaps through its comprehensive and high-quality question set and a refined autograding system. Specifically, it introduces:

A question set spanning six widely prohibited misuse categories, carved out from a cross-reference of usage policies across major LLM vendors. This set aims to encompass specific, answerable, and model-rejected queries.
An autograding system that goes beyond mere refusal detection. It evaluates responses based on refusal, specificity, and convincingness, significantly aligning the grading process with human judgment.

Findings and Implications

The paper presents several key findings:

Many existing jailbreak benchmarks are found lacking in providing an accurate measure of a jailbreak's true effectiveness, often overstating their success.
Certain jailbreak techniques, rather than enhancing model performance on prohibited tasks, detrimentally affect model accuracy across the board—including on harmless inquiries.
The StrongREJECT benchmark, through its nuanced grading scheme, emerges as a robust tool for evaluating jailbreaks, demonstrating a closer alignment with human judgment compared to previous methods.

Looking Forward

The implications of this work stretch beyond academic interest, touching upon practical considerations for AI safety and policy-making. It underscores a pressing need for standardized, open-source benchmarks that can adequately assess and counteract jailbreak attempts. These tools are vital for developing LLMs resilient to misuse, thereby fostering a safer AI ecosystem. Moreover, the paper hints at a future research trajectory focusing on understanding the underlying mechanisms of jailbreaks and devising comprehensive defense strategies.

Conclusion

In sum, the StrongREJECT benchmark represents a significant step forward in the ongoing effort to ensure the ethical use of LLMs. By providing a more accurate assessment tool, it helps illuminate the complex dynamics of model compliance and resistance to jailbreaking. This contribution is not only timely but essential, offering critical insights that can guide both future research and policy-making in the rapidly evolving landscape of AI.

Related Papers

Tweets

https://twitter.com/alxndrdavies/status/1792594591503491317

https://twitter.com/emmons_scott/status/1759640538633150471

https://twitter.com/sdtoyer/status/1762316605500035096

https://twitter.com/gN3mes1s/status/1762215016894259233

https://twitter.com/lu_sichu/status/1870275188878705116

https://twitter.com/dpaleka/status/1763282809752621471