Papers
Topics
Authors
Recent
Search
2000 character limit reached

Best-of-N Jailbreaking

Published 4 Dec 2024 in cs.CL, cs.AI, and cs.LG | (2412.03556v2)

Abstract: We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source LLMs, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers. BoN also seamlessly extends to other modalities: it jailbreaks vision LLMs (VLMs) such as GPT-4o and audio LLMs (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. BoN reliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude. BoN Jailbreaking can also be composed with other black-box algorithms for even more effective attacks - combining BoN with an optimized prefix attack achieves up to a 35% increase in ASR. Overall, our work indicates that, despite their capability, LLMs are sensitive to seemingly innocuous changes to inputs, which attackers can exploit across modalities.

Citations (1)

Summary

  • The paper presents a black-box Best-of-N Jailbreaking technique that exploits AI vulnerabilities across text, vision, and audio modalities.
  • It demonstrates that iterative sampling of prompt variations can increase the Attack Success Rate, revealing a power-law relationship in performance scaling.
  • The study outlines compositional strategies with prefix optimization that enhance sample efficiency and reduce computational overhead.

Best-of-N Jailbreaking

Introduction

This essay explores the concepts and practical applications of the "Best-of-N Jailbreaking" technique for exploiting vulnerabilities in frontier AI systems. This approach highlights the ability of attackers to penetrate various LLMs by generating multiple prompt variations to elicit harmful responses. With a focus on a black-box paradigm, this method allows attacks without the need for model-specific internals, emphasizing the cross-modal scalability and robustness of the technique. Figure 1

Figure 1: Overview of BoN Jailbreaking, the performance across three input modalities and its scaling behavior.

Methodology

Algorithmic Framework

The core strategy behind Best-of-N (BoN) Jailbreaking involves applying a black-box approach that iteratively samples augmented variations of a prompt. The primary goal is to repeatedly introduce slight modifications until the target AI model produces a harmful output. Each prompt variation employs a mixture of augmentations — from character scrambling and capitalization in text to audio and visual perturbations for other modalities. Figure 2

Figure 2: Overview of modality-specific augmentations utilized in BoN Jailbreaking.

Attack Success Rate (ASR)

Attack success rate serves as the primary metric for gauging the effectiveness of BoN Jailbreaking. Across different modalities and samples (NN), ASRs are assessed to understand the efficacy and efficiency of attack strategies. Notably, empirical observations reveal power-law behavior in ASR scaling as a function of NN, suggesting significant potential for computational resource allocation to enhance ASR.

Experimental Validation

Multi-Modal Performance

The BoN Jailbreaking approach seamlessly extends across multiple modalities—text, vision, and audio. For instance, in text models like Claude 3.5 Sonnet, the ASR reached as high as 78% with 10,000 augmented samples. Similarly, visual LLMs and audio LLMs demonstrated substantial vulnerabilities with tailored augmentations specific to each modality.

Scaling Laws

BoN Jailbreaking demonstrates predictable scaling behavior, reminiscent of a power-law relationship when plotting the negative log ASR against the number of samples. This behavior allows for efficient forecasting and anticipatory resource allocation strategies to achieve higher ASRs with fewer computational trials. Figure 3

Figure 3: Negative log ASR exhibits power law-like behavior across models and modalities.

Synergy with Other Techniques

BoN Jailbreaking can be further optimized by combining it with other attack methodologies such as Many-Shot Jailbreaking (MSJ). The compositional application of BoN and optimized prefixes can significantly reduce the sample burden, enhancing both the sample efficiency and final ASR. Additionally, this combination reduces the computational cost while maintaining attack effectiveness. Figure 4

Figure 4: BoN with prefix composition dramatically improves sample efficiency.

Implications and Future Directions

The findings from this exploration of BoN Jailbreaking underscore the sensitivity and vulnerability of current AI models to iterative, multi-modal perturbations. This sensitivity reveals gaps in the model's defenses and highlights the pressing need for more robust safety mechanisms, particularly in stochastic outputs and large-scale input spaces. Future research should focus on developing gradient-free optimization techniques and more sophisticated augmentation strategies to fortify models against such adversarial attacks.

Conclusion

Best-of-N Jailbreaking represents an efficient and versatile method for exposing vulnerabilities across AI modalities. This approach not only demonstrates the fragility of sophisticated AI systems under targeted perturbations but also paves the way for improved understanding of attack surfaces in multifaceted model architectures. The work presented highlights the necessity for continued advancements in both attack and defense mechanisms within the AI community to ensure the responsible deployment and use of AI technologies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 37 tweets with 6177 likes about this paper.

HackerNews

  1. Best-of-N Jailbreaking (66 points, 15 comments) 
  2. Best-of-N Jailbreaking (16 points, 1 comment)