Best-of-N Jailbreaking (2412.03556v2)

Published 4 Dec 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source LLMs, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers. BoN also seamlessly extends to other modalities: it jailbreaks vision LLMs (VLMs) such as GPT-4o and audio LLMs (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. BoN reliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude. BoN Jailbreaking can also be composed with other black-box algorithms for even more effective attacks - combining BoN with an optimized prefix attack achieves up to a 35% increase in ASR. Overall, our work indicates that, despite their capability, LLMs are sensitive to seemingly innocuous changes to inputs, which attackers can exploit across modalities.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that minor prompt modifications can bypass LLM safety measures, achieving ASRs up to 89% on GPT-4o and 78% on Claude 3.5 Sonnet.
It employs an automated black-box strategy using random prompt and modality augmentations to efficiently attack text, vision, and audio models.
The empirical power-law scaling of attack success with sample size highlights critical LLM vulnerabilities and informs future AI security defenses.

The paper "Best-of-N Jailbreaking" presents an automated black-box method for effectively compromising LLMs using prompt augmentations. The proposed approach demonstrates significant efficacy across multiple modalities, including text, vision, and audio. This summary evaluates the algorithm's performance, analyzes its implications, and considers potential future applications and developments in AI security.

Best-of-N (BoN) Jailbreaking exploits minor changes in input prompts to bypass safety mechanisms in LLMs. By randomly sampling augmentations such as text capitalization or image modifications, BoN achieves Attack Success Rates (ASRs) as high as 89% on closed-source models like GPT-4o and 78% on Claude 3.5 Sonnet with 10,000 samples. The empirical scaling of ASR with sample size follows a power-law model, which enhances the predictability and efficiency of the attack.

The algorithm's strength lies in its simplicity and generality: it requires no model log probabilities or gradients, allowing it to target black-box models without internal access. Further, BoN's adaptability to various input modalities underscores LLMs' vulnerability to innocuous transformations. For instance, augmenting an image's text font or audio speed leads to successful jailbreaks on state-of-the-art vision and audio models.

Upon analyzing the performance metrics, BoN Jailbreaking holds particular relevance for researchers focused on adversarial AI and model robustness. By identifying the stochastic nature and dimensional sensitivity of LLMs as exploitable attributes, the paper foregrounds an important discussion on future-proofing AI systems against such automated threats.

The implications of BoN Jailbreaking extend beyond theoretical exploration to practical applications in AI safety red-teaming. The method provides an efficient framework for evaluating the robustness of defense mechanisms. It also serves as an impetus for the development of more sophisticated safeguard mechanisms, potentially incorporating adversarial training tailored to counteract such black-box attacks.

Moreover, the scalability and efficacy of BoN suggest it could eventually generalize across more domains and models. The paper speculates that further computational resources could lead to potential breakthroughs in understanding and mitigating adversarial attacks.

In conclusion, this research offers a comprehensive evaluation of BoN Jailbreaking as an instrumental tool in AI adversarial strategies and defense validation. As models grow more complex and input space-dimensionality increases, BoN highlights the persistent need for robust, adaptive security measures to prevent misuse across multi-modal LLMs. Future research avenues include optimizing automated defense mechanisms and integrating more nuanced adversarial training protocols to protect these models effectively.

PDF Markdown

Related Papers

Tweets

https://twitter.com/MatthewBerman/status/1870120582857245173

https://twitter.com/AnthropicAI/status/1867608917595107443

https://twitter.com/maksym_andr/status/1864595779257577757

https://twitter.com/jplhughes/status/1867613794845000058

https://twitter.com/RylanSchaeffer/status/1867614527581565382

https://twitter.com/sprice354_/status/1867617046043013358

YouTube

Show All Videos

HackerNews

Best-of-N Jailbreaking (66 points, 15 comments)
Best-of-N Jailbreaking (16 points, 1 comment)

Reddit

Best-of-N Jailbreaking (8 points, 1 comment)
Best-of-N Jailbreaking (1 point, 0 comments)
[2412.03556] Best-of-N Jailbreaking (1 point, 0 comments)
Best-of-N (BoN) Jailbreaking is an algorithm that jailbreaks AI systems. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. (0 points, 2 comments)