Refusal in Language Models Is Mediated by a Single Direction (2406.11717v3)

Published 17 Jun 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Conversational LLMs are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.

Citations (49)

View on Semantic Scholar

Summary

The paper identifies a single direction in LLM residual activations that governs refusal of harmful content.
It introduces a white-box jailbreak technique using weight orthogonalization to bypass refusal mechanisms while preserving performance.
The study highlights vulnerabilities in current safety protocols and calls for more robust alignment strategies in LLMs.

Refusal in LLMs Is Mediated by a Single Direction

Overview of the Research

The paper "Refusal in LLMs Is Mediated by a Single Direction" investigates how LLMs are fine-tuned to refuse harmful instructions. The research team reveals that this refusal behavior can be attributed to a one-dimensional subspace within the model's residual stream activations. The paper spans 13 widely used open-source chat models with up to 72 billion parameters.

Key Methodologies and Findings

1. Extraction of Refusal Direction:

The core finding is the identification of a single refusal direction that mediates refusal behavior. This direction is extracted by comparing the residual stream activations between harmful and harmless instructions. The methodology employs contrastive pairs to determine a difference-in-means vector that encapsulates the refusal feature. Ablating this direction from the model's activations effectively disables refusal, while adding this direction induces refusal, even on benign prompts.

2. White-box Jailbreak Technique:

Leveraging the insight into refusal direction, the authors propose a novel white-box jailbreak method. This involves an interpretable rank-one weight edit, termed as weight orthogonalization, which modifies the model weights to prevent the model from encoding the refusal direction. The proposed method circumvents traditional fine-tuning or training on harmful examples while maintaining high fidelity to the model’s original capabilities.

3. Coherence Evaluation:

The paper rigorously tests the coherence of the modified model using various LLMing benchmarks, such as MMLU, ARC, GSM8K, and TruthfulQA. The evaluations show minimal degradation in model performance, suggesting that the orthogonalization method preserves the overall capability of the models.

4. Adversarial Suffix Mechanistic Analysis:

Additionally, the research explores how adversarial suffixes can suppress the refusal-mediating direction. Analysis shows that adversarial suffixes hijack the attention mechanism in transformers, diverting focus away from harmful instructions to benign suffixes, thus disrupting the propagation of the refusal feature.

Implications and Future Directions

Practical Implications:

The findings underscore the brittleness of current safety mechanisms in LLMs. The simplicity of the refusal feature and the ease of its circumvention pose significant risks for real-world applications of AI. The proposed white-box jailbreak method demonstrates that sophisticated safety protocols can be bypassed with minimal computational effort.

Theoretical Implications:

The paper contributes significant insights to the field of mechanistic interpretability. Understanding refusal through the lens of linear feature directions bridges a gap in current AI research, where the internal workings of LLMs are often treated as black-box operations. This contributes to the broader understanding of how fine-tuning shapes model behavior at the representational level.

Speculation on Future Developments:

Moving forward, researchers may explore more robust alignment techniques that do not rely on linear intervention vulnerabilities. Additionally, the approach of isolating and manipulating feature directions may be extended to other safety-critical aspects such as truthfulness and bias correction. The paper may stimulate further exploration into adversarial defenses, emphasizing the necessity for more resilient safety fine-tuning methods.

Conclusion

This paper offers profound insights into the refusal mechanisms of LLMs, revealing the simplicity and vulnerability of current safety protocols. By dissecting the refusal behavior into a single identifiable direction, it opens pathways for both enhancing and potentially undermining model safety. The proposed methodologies for jailbreaking highlight the need for more robust and comprehensive alignment strategies as AI systems become increasingly integrated into high-stakes environments.

Related Papers

Tweets

https://twitter.com/maximelabonne/status/1803108202784997672

https://twitter.com/fly51fly/status/1804523958060142887

https://twitter.com/hamandcheese/status/1805824179058409712

https://twitter.com/jd_pressman/status/1904411317332750588

https://twitter.com/adilsoubki/status/1831125992158453968

https://twitter.com/arvidkahl/status/1803747999153111057