- The paper presents KL-then-steer, a method that minimizes performance degradation from steering vectors while improving adversarial robustness.
- It evaluates models on benchmarks like MT-Bench and manual jailbreak tests, achieving a 44% reduction in jailbreak success rates.
- The technique balances safety and performance, offering a lightweight, post-deployment control method that reduces both sycophancy and harmful outputs.
Steering Without Side Effects: Improving Post-Deployment Control of LLMs
The paper "Steering Without Side Effects: Improving Post-Deployment Control of LLMs" by Stickland et al. addresses a significant challenge in the deployment of LMs: the unexpected and potentially harmful behaviors these models can exhibit post-deployment despite extensive pre-release adversarial training. The authors propose a method named KL-then-steer (KTS) to improve the robustness of these models against such behaviors while maintaining their performance on benign tasks.
Overview of Techniques
The aim of the paper is to navigate the trade-offs between model robustness and performance post-deployment. Traditional methods like frequent retraining are infeasible due to logistical challenges and the resultant unstable user experience. Instead, the authors explore activation steering—a technique involving adding specific vectors to the model’s hidden states to influence its behavior. However, a noted drawback is that steering vectors can degrade model performance on benign inputs.
The novel contribution here, KL-then-steer (KTS), aims to minimize these performance degradations. The technique involves minimizing the Kullback-Leibler (KL) divergence between the steered and unsteered model outputs during training using benign inputs. Subsequently, steering vectors are applied during inference. This significantly reduces the negative side effects of steering, as demonstrated by the empirical results.
Experimental Framework
The paper evaluates the initial model, the model with direct steering, and the KTS model on a multitask benchmark. The evaluation metrics include:
- Adversarial Robustness: Assessed using a manual jailbreak benchmark and a prefill attack.
- Model Capabilities: Evaluated using MT-Bench, which measures conversational fluency and helpfulness.
- Sycophancy: Measured using an augmented version of TruthfulQA, assessing the model’s susceptibility to user-suggested answers.
Key Findings
- Performance on Adversarial Robustness: The KTS model achieved a 44% reduction in jailbreak attacks compared to the original Llama-2-chat-7B model. Notably, the KTS model maintained high MT-Bench scores, indicating minimal degradation in helpfulness.
- Comparison with Baselines:
- System Prompts: While modifying the system prompt reduced adversarial robustness, it significantly decreased the model’s general performance due to increased refusals of benign requests.
- LoRA Fine-Tuning: Fine-tuning with Direct Preference Optimization (DPO) yielded robust results, but stacking steering vectors on DPO models further improved their performance.
- Classifier-Assisted Steering: The use of classifiers (logistic probes and Llama Guard 2) to selectively apply steering vectors proved effective. This approach results in slightly less robust models but with substantially better performance on benign tasks. When tested, the classifiers improved MT-Bench scores while still managing to reduce adversarial vulnerabilities.
- Generalization: KTS was effective not only for adversarial robustness but also for reducing sycophancy. The steered KTS model chose user-suggested answers 45% less often and showed higher accuracy on correct answers than the unsteered models.
Practical and Theoretical Implications
The KTS technique offers a practical solution for model developers who need to update the behaviors of deployed models without extensive retraining. This methodology allows for targeted behavior modifications, reducing the reliance on monolithic fine-tuning approaches that can destabilize model performance across different user workflows.
From a theoretical perspective, the KTS method underscores the potential of leveraging representation engineering in AI safety. The ability to fine-tune models in a lightweight, post-deployment setting addresses significant challenges in deploying LMs in real-world high-stake scenarios. Future research could explore reinforcement learning-based alternatives to KTS or investigate hybrid approaches combining various control methods for optimal results.
Future Directions
Future research could focus on several areas, including:
- Exploring Different Classifier Designs: Enhancing the accuracy and robustness of classifiers used to determine when to apply steering vectors.
- Reinforcement Learning Methods: Examining reinforcement learning techniques to optimize steering strategies dynamically.
- Combination Strategies: Investigating new ways to integrate KTS with other fine-tuning techniques to expand the suite of available tools for behavior correction.
Conclusion
The paper by Stickland et al. introduces a compelling method for improving the robustness of deployed LMs through the KL-then-steer technique. By effectively balancing the trade-offs between adversarial robustness and performance on benign tasks, KTS facilitates safer deployments of LMs with minimal side effects, advancing the field of AI safety.