Steering Without Side Effects: Improving Post-Deployment Control of Language Models (2406.15518v1)

Published 21 Jun 2024 in cs.CL and cs.LG

Abstract: LLMs (LMs) have been shown to behave unexpectedly post-deployment. For example, new jailbreaks continually arise, allowing model misuse, despite extensive red-teaming and adversarial training from developers. Given most model queries are unproblematic and frequent retraining results in unstable user experience, methods for mitigation of worst-case behavior should be targeted. One such method is classifying inputs as potentially problematic, then selectively applying steering vectors on these problematic inputs, i.e. adding particular vectors to model hidden states. However, steering vectors can also negatively affect model performance, which will be an issue on cases where the classifier was incorrect. We present KL-then-steer (KTS), a technique that decreases the side effects of steering while retaining its benefits, by first training a model to minimize Kullback-Leibler (KL) divergence between a steered and unsteered model on benign inputs, then steering the model that has undergone this training. Our best method prevents 44% of jailbreak attacks compared to the original Llama-2-chat-7B model while maintaining helpfulness (as measured by MT-Bench) on benign requests almost on par with the original LM. To demonstrate the generality and transferability of our method beyond jailbreaks, we show that our KTS model can be steered to reduce bias towards user-suggested answers on TruthfulQA. Code is available: https://github.com/AsaCooperStickland/kl-then-steer.

Citations (10)

View on Semantic Scholar

Summary

The paper presents KL-then-steer, a method that minimizes performance degradation from steering vectors while improving adversarial robustness.
It evaluates models on benchmarks like MT-Bench and manual jailbreak tests, achieving a 44% reduction in jailbreak success rates.
The technique balances safety and performance, offering a lightweight, post-deployment control method that reduces both sycophancy and harmful outputs.

Steering Without Side Effects: Improving Post-Deployment Control of LLMs

The paper "Steering Without Side Effects: Improving Post-Deployment Control of LLMs" by Stickland et al. addresses a significant challenge in the deployment of LMs: the unexpected and potentially harmful behaviors these models can exhibit post-deployment despite extensive pre-release adversarial training. The authors propose a method named KL-then-steer (KTS) to improve the robustness of these models against such behaviors while maintaining their performance on benign tasks.

Overview of Techniques

The aim of the paper is to navigate the trade-offs between model robustness and performance post-deployment. Traditional methods like frequent retraining are infeasible due to logistical challenges and the resultant unstable user experience. Instead, the authors explore activation steering—a technique involving adding specific vectors to the model’s hidden states to influence its behavior. However, a noted drawback is that steering vectors can degrade model performance on benign inputs.

The novel contribution here, KL-then-steer (KTS), aims to minimize these performance degradations. The technique involves minimizing the Kullback-Leibler (KL) divergence between the steered and unsteered model outputs during training using benign inputs. Subsequently, steering vectors are applied during inference. This significantly reduces the negative side effects of steering, as demonstrated by the empirical results.

Experimental Framework

The paper evaluates the initial model, the model with direct steering, and the KTS model on a multitask benchmark. The evaluation metrics include:

Adversarial Robustness: Assessed using a manual jailbreak benchmark and a prefill attack.
Model Capabilities: Evaluated using MT-Bench, which measures conversational fluency and helpfulness.
Sycophancy: Measured using an augmented version of TruthfulQA, assessing the model’s susceptibility to user-suggested answers.

Key Findings

Performance on Adversarial Robustness: The KTS model achieved a 44% reduction in jailbreak attacks compared to the original Llama-2-chat-7B model. Notably, the KTS model maintained high MT-Bench scores, indicating minimal degradation in helpfulness.
Comparison with Baselines:
- System Prompts: While modifying the system prompt reduced adversarial robustness, it significantly decreased the model’s general performance due to increased refusals of benign requests.
- LoRA Fine-Tuning: Fine-tuning with Direct Preference Optimization (DPO) yielded robust results, but stacking steering vectors on DPO models further improved their performance.
Classifier-Assisted Steering: The use of classifiers (logistic probes and Llama Guard 2) to selectively apply steering vectors proved effective. This approach results in slightly less robust models but with substantially better performance on benign tasks. When tested, the classifiers improved MT-Bench scores while still managing to reduce adversarial vulnerabilities.
Generalization: KTS was effective not only for adversarial robustness but also for reducing sycophancy. The steered KTS model chose user-suggested answers 45% less often and showed higher accuracy on correct answers than the unsteered models.

Practical and Theoretical Implications

The KTS technique offers a practical solution for model developers who need to update the behaviors of deployed models without extensive retraining. This methodology allows for targeted behavior modifications, reducing the reliance on monolithic fine-tuning approaches that can destabilize model performance across different user workflows.

From a theoretical perspective, the KTS method underscores the potential of leveraging representation engineering in AI safety. The ability to fine-tune models in a lightweight, post-deployment setting addresses significant challenges in deploying LMs in real-world high-stake scenarios. Future research could explore reinforcement learning-based alternatives to KTS or investigate hybrid approaches combining various control methods for optimal results.

Future Directions

Future research could focus on several areas, including:

Exploring Different Classifier Designs: Enhancing the accuracy and robustness of classifiers used to determine when to apply steering vectors.
Reinforcement Learning Methods: Examining reinforcement learning techniques to optimize steering strategies dynamically.
Combination Strategies: Investigating new ways to integrate KTS with other fine-tuning techniques to expand the suite of available tools for behavior correction.

Conclusion

The paper by Stickland et al. introduces a compelling method for improving the robustness of deployed LMs through the KL-then-steer technique. By effectively balancing the trade-offs between adversarial robustness and performance on benign tasks, KTS facilitates safer deployments of LMs with minimal side effects, advancing the field of AI safety.