Mission Impossible: A Statistical Perspective on Jailbreaking LLMs (2408.01420v1)

Published 2 Aug 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs are trained on a deluge of text data with limited quality control. As a result, LLMs can exhibit unintended or even harmful behaviours, such as leaking information, fake news or hate speech. Countermeasures, commonly referred to as preference alignment, include fine-tuning the pretrained LLMs with carefully crafted text examples of desired behaviour. Even then, empirical evidence shows preference aligned LLMs can be enticed to harmful behaviour. This so called jailbreaking of LLMs is typically achieved by adversarially modifying the input prompt to the LLM. Our paper provides theoretical insights into the phenomenon of preference alignment and jailbreaking from a statistical perspective. Under our framework, we first show that pretrained LLMs will mimic harmful behaviour if present in the training corpus. Under that same framework, we then introduce a statistical notion of alignment, and lower-bound the jailbreaking probability, showing that it is unpreventable under reasonable assumptions. Based on our insights, we propose an alteration to the currently prevalent alignment strategy RLHF. Specifically, we introduce a simple modification to the RLHF objective, we call E-RLHF, that aims to increase the likelihood of safe responses. E-RLHF brings no additional training cost, and is compatible with other methods. Empirically, we demonstrate that E-RLHF outperforms RLHF on all alignment problems put forward by the AdvBench and HarmBench project without sacrificing model performance as measured by the MT-Bench project.

PDF HTML Abstract

Mission Impossible: A Statistical Perspective on Jailbreaking LLMs

Overview and Motivation

This paper, authored by Jingtong Su, Julia Kempe, and Karen Ullrich, investigates the resilience of LLMs against adversarial attacks designed to elicit harmful behaviors—a practice known as "jailbreaking". The authors focus on the statistical underpinnings of this phenomenon, framing their analysis within a rigorous theoretical context. They argue that the inherent complexity and variety in training data render perfect alignment of LLM behavior with human ethical guidelines statistically unfeasible.

Theoretical Framework

The authors introduce a novel framework for understanding how LLMs, even when aligned through methods like Reinforcement Learning from Human Feedback (RLHF), can be coerced into producing harmful outputs. The paper begins by outlining the problem: LLMs, trained on diverse and vast corpora, are susceptible to generating inappropriate or malicious content, as evidenced by numerous public instances of LLM jailbreaking using cleverly crafted prompts.

Notations and Assumptions:

Decomposition of Prompts: The paper decomposes prompts into two primary components: queries (q) and concepts (c). The concept encapsulates the core content or intent, while the query frames how the concept is expanded.
Conceptual Model: The paper assumes a latent world model $p_{world}$ governing "ground-truth" distributions of language output given any concept-query pair. This model reflects the natural statistical properties of language.
Adversarial Strength Bound: An adversary's power is measured by their ability to perturb query strings while preserving the underlying harmful content.

Statistical Insights and Contributions

PAC-Bayesian Generalization Bound: The authors derive a PAC-Bayesian generalization bound for pretrained LLMs. They conclude that a well-pretrained LLM will, inevitably, exhibit behaviors that reflect distributional properties present in the training data, including harmful behaviors.
Unpreventability of Jailbreaking: The analysis extends to demonstrate that jailbreaking remains unavoidable post-alignment because adversaries can always find perturbations that nudge the LLM's output into the harmful set. They formalize this through a series of probability bounds, showing that the likelihood of an adversary successfully jailbreaking an LLM is significant even under robust alignment.

Practical Implications

The authors propose a practical alignment strategy: E-RLHF (Enhanced Reinforcement Learning from Human Feedback). This method involves modifying the RLHF objective to increase the likelihood of generating safe responses. E-RLHF broadens the safety zone of the output distribution by incorporating a modified reference model to include safe transformations of harmful prompts. The main theoretical motivation is that expanding the size of the safety zone relative to the harmful zone statistically reduces the likelihood of successful jailbreak attempts.

Empirical Validation

The empirical results support the theoretical claims. The paper demonstrates that E-RLHF significantly improves upon standard RLHF on various alignment challenges presented by projects like AdvBench and HarmBench. The modification outperforms RLHF across several metrics without compromising the performance of the LLM in standard benchmarks.

Future Directions

The implications of this research emphasize the need for continuous evolution of alignment strategies as adversarial methods advance. The authors highlight several areas for future work:

Dynamic World Models: Addressing the dynamic nature of $p_{world}$ and $D_{\mathcal{P}}$ as societal norms shift.
Multi-round Conversations: Extending the theoretical framework to accommodate multi-step, multi-turn interactions which more closely mimic real-world LLM applications.
Advanced Safe Prompts: Further tuning and optimizing the safe transformation of prompts to enhance robustness against diverse adversarial strategies.

Conclusion

This paper provides a rigorous statistical perspective on the inherent vulnerabilities of LLMs to adversarial jailbreaking. Through a combination of theoretical analysis and empirical experimentation, it demonstrates the persistent challenge of aligning LLM behavior perfectly with human ethical standards. The proposed E-RLHF offers a promising direction for mitigating these risks by statistically expanding the safety zones within the output distribution of LLMs, thereby reducing the effectiveness of adversarial attacks.