Steering Language Model Refusal with Sparse Autoencoders (2411.11296v2)

Published 18 Nov 2024 in cs.LG

Abstract: Responsible deployment of LLMs requires mechanisms for refusing unsafe prompts while preserving model performance. While most approaches modify model weights through additional training, we explore an alternative: steering model activations at inference time via amplifying sparse autoencoder (SAE) features that mediate refusal. This work uncovers a fundamental tension between SAE steering-based safety improvements and general model capabilities. While feature steering successfully improves robustness against both single-turn and challenging multi-turn jailbreak attempts, we discover that this comes at a previously underexplored cost -- systematic degradation of performance across multiple benchmark tasks, even on safe inputs with no apparent connection to refusal behavior. This suggests that features mediating refusal may be more deeply entangled with general LLM capabilities than previously understood. Our findings reveal important open questions about the nature of safety-relevant features in LLMs and the feasibility of isolating them for targeted intervention. While SAE-based steering shows promise as a flexible approach to enhancing LLM safety, our results highlight the critical need to understand and address the mechanisms behind these capability tradeoffs before such techniques can be practically deployed.

Summary

The paper demonstrates that SAEs can identify and clamp refusal-mediating features using a handcrafted prompt to adjust safety responses.
The method boosts refusal rates on benchmarks like Wild Guard while reducing adversarial attack success, indicating improved safety.
The study highlights a trade-off where enhanced safety via feature steering also diminishes performance on safe prompts and factual recall.

Steering LLM Refusal with Sparse Autoencoders: An Overview

The paper "Steering LLM Refusal with Sparse Autoencoders," authored by Kyle O’Brien et al. from Microsoft, addresses the critical challenge of enhancing the safety of LLMs (LMs) by steering their refusal behavior. The focus is on making LMs refuse unsafe prompts while maintaining compliance with safe ones. This is achieved without altering model weights by leveraging Sparse Autoencoders (SAEs) to identify and modify specific features that mediate refusal behavior in the Phi-3 Mini model.

Methodology and Key Findings

The research employs Sparse Autoencoders to identify features within the activation layers of the Phi-3 Mini model that influence refusal responses. By clamping these features to specific values, the researchers can manually adjust the model's tendency to refuse unsafe prompts.

Feature Identification:
- The paper presents a straightforward method for identifying refusal-mediating features. Notable is the use of a single handcrafted prompt to activate relevant features.
- Features are found using SAEs trained on the activations of the model, providing a sparse representation that highlights the influence of each feature on refusal behavior.
Safety Improvements:
- Steering using identified features increases the refusal rate of unsafe prompts, demonstrating potential robustness against attempts to bypass safety measures.
- The results show a considerable improvement in refusal rates on benchmarks like Wild Guard and XSTest, as well as a reduced success rate of multi-turn adversarial attacks in the Crescendo framework.
Trade-offs in Performance:
- While enhancing safety, feature steering has adverse effects on the model's performance concerning safe prompt refusals and benchmarks measuring factual recall and reasoning.
- A rise in over-refusals for safe prompts signifies a trade-off where steering can reduce the model's overall utility.
Feature Ablation Study:
- The paper conducts a feature ablation to discern the impact on unrelated capabilities. Steering a feature associated with philosophical discussions showed similar degradations in performance, emphasizing potential limitations of feature steering across different applications.

Implications and Future Directions

Practical Implications:

The method offers an efficient means of enhancing safety post-deployment as it circumvents the need for retraining. This can be particularly valuable for real-time applications where immediate safety adjustments are needed.
However, the trade-off between safety and performance merits careful consideration. Applications that require high factual accuracy might face significant challenges when applying feature steering.

Theoretical Implications:

This research contributes to the growing body of work on mechanistic interpretability and steering, offering insights into managing LM behavior without standard retraining techniques.
It highlights the complexities of single-feature mediations in behavior, prompting further inquiry into more nuanced, multi-feature mediation.

Research Opportunities:

Future work could explore larger, more granular SAEs to find features that could minimize performance trade-offs while achieving safety improvements.
Investigating conditional steering methods, which dynamically apply steering during specific context detections, could mitigate the adverse performance impacts.
Expansion to other LM architectures and scales might provide a broader understanding of the generalizability of these findings.

Overall, the paper by O’Brien et al. presents a promising yet challenging avenue for enhancing LLM safety. It underscores the potential of feature steering while also delineating the careful approach required to balance improved safety with preservation of the model's inherent capabilities. As the field progresses, integrating such interpretability techniques will be key to deploying safe and reliable AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/KyleDevinOBrien/status/1858698819904696447

https://twitter.com/gm8xx8/status/1858785058955436270

https://twitter.com/KyleDevinOBrien/status/1905450718888333764

https://twitter.com/KyleDevinOBrien/status/1905352563761074246

https://twitter.com/GptMaestro/status/1859257055791751517