Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

Extending Activation Steering to Broad Skills and Multiple Behaviours (2403.05767v1)

Published 9 Mar 2024 in cs.LG, cs.AI, cs.CL, and cs.CY

Abstract: Current LLMs have dangerous capabilities, which are likely to become more problematic in the future. Activation steering techniques can be used to reduce risks from these capabilities. In this paper, we investigate the efficacy of activation steering for broad skills and multiple behaviours. First, by comparing the effects of reducing performance on general coding ability and Python-specific ability, we find that steering broader skills is competitive to steering narrower skills. Second, we steer models to become more or less myopic and wealth-seeking, among other behaviours. In our experiments, combining steering vectors for multiple different behaviours into one steering vector is largely unsuccessful. On the other hand, injecting individual steering vectors at different places in a model simultaneously is promising.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217.
  2. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  3. Fast machine unlearning without retraining through selective synaptic dampening. arXiv preprint arXiv:2308.07707.
  4. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  5. Improving activation steering in language models with mean-centring. arXiv preprint arXiv:2312.03813.
  6. Leike, J. (2022). Distinguishing three alignment taxes.
  7. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341.
  8. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35.
  9. Copy suppression: Comprehensively understanding an attention head. arXiv preprint arXiv:2310.04625.
  10. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251.
  11. Dissecting large language models. In Socially Responsible Language Modelling Research.
  12. Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681.
  13. Exploring the landscape of machine unlearning: A survey and taxonomy. arXiv preprint arXiv:2305.06360.
  14. Model evaluation for extreme risks.
  15. Language models are better than humans at next-token prediction. arXiv preprint arXiv:2212.11281.
  16. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  17. Natural language processing with transformers. " O’Reilly Media, Inc.".
  18. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248.
  19. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com