Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 93 tok/s
Gemini 2.5 Pro 55 tok/s Pro
GPT-5 Medium 15 tok/s
GPT-5 High 20 tok/s Pro
GPT-4o 98 tok/s
GPT OSS 120B 460 tok/s Pro
Kimi K2 217 tok/s Pro
2000 character limit reached

Tradeoffs Between Alignment and Helpfulness in Language Models with Steering Methods (2401.16332v5)

Published 29 Jan 2024 in cs.CL and cs.AI

Abstract: LLM alignment has become an important component of AI safety, allowing safe interactions between humans and LLMs, by enhancing desired behaviors and inhibiting undesired ones. It is often done by tuning the model or inserting preset aligning prompts. Recently, representation engineering, a method which alters the model's behavior via changing its representations post-training, was shown to be effective in aligning LLMs (Zou et al., 2023a). Representation engineering yields gains in alignment oriented tasks such as resistance to adversarial attacks and reduction of social biases, but was also shown to cause a decrease in the ability of the model to perform basic tasks. In this paper we study the tradeoff between the increase in alignment and decrease in helpfulness of the model. We propose a theoretical framework which provides bounds for these two quantities, and demonstrate their relevance empirically. First, we find that under the conditions of our framework, alignment can be guaranteed with representation engineering, and at the same time that helpfulness is harmed in the process. Second, we show that helpfulness is harmed quadratically with the norm of the representation engineering vector, while the alignment increases linearly with it, indicating a regime in which it is efficient to use representation engineering. We validate our findings empirically, and chart the boundaries to the usefulness of representation engineering for alignment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
  2. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861, 2021.
  3. Imane El Atillah. Man ends his life after an ai chatbot ’encouraged’ him to sacrifice himself to stop climate change. Euronews, 2023.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  7. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  9. Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9:160–175, 2021.
  10. In-context learning creates task vectors. arXiv preprint arXiv:2310.15916, 2023.
  11. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  12. Unsolved problems in ml safety. arXiv preprint arXiv:2109.13916, 2021.
  13. Inspecting and editing knowledge representations in language models. arXiv preprint arXiv:2304.00740, 2023.
  14. Social biases in NLP models as barriers for persons with disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5491–5501, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.487. URL https://aclanthology.org/2020.acl-main.487.
  15. Improving activation steering in language models with mean-centring. arXiv preprint arXiv:2312.03813, 2023.
  16. Self-detoxifying language models via toxification reversal. arXiv preprint arXiv:2310.09573, 2023.
  17. Inference-time intervention: Eliciting truthful answers from a language model, july 2023. URL http://arxiv. org/abs/2306.03341.
  18. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3214–3252, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.229. URL https://aclanthology.org/2022.acl-long.229.
  19. In-context vectors: Making in context learning more effective and controllable through latent space steering. arXiv preprint arXiv:2311.06668, 2023a.
  20. Aligning large language models with human preferences through representation engineering. arXiv preprint arXiv:2312.15997, 2023b.
  21. Stereoset: Measuring stereotypical bias in pretrained language models, 2020.
  22. Richard Ngo. The alignment problem from a deep learning perspective. arXiv preprint arXiv:2209.00626, 2022.
  23. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
  24. OpenAI. Gpt-4 technical report, 2023.
  25. The effects of reward misspecification: Mapping and mitigating misaligned models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=JYtwGwIL7ye.
  26. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023a.
  27. The linear representation hypothesis and the geometry of large language models. arXiv preprint arXiv:2311.03658, 2023b.
  28. Language models are unsupervised multitask learners. 2019.
  29. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446, 2021.
  30. Adversarial robustness with semi-infinite constrained learning. Advances in Neural Information Processing Systems, 34:6198–6215, 2021.
  31. Probabilistically robust learning: Balancing average and worst-case performance. In International Conference on Machine Learning, pp.  18667–18686. PMLR, 2022.
  32. Kevin Roose. A conversation with bing’s chatbot left me deeply unsettled. New York Times, 2023.
  33. Introducing chatgpt. OpenAI blog, 2023.
  34. On the ethics of building ai in a responsible manner. arXiv preprint arXiv:2004.04644, 2020.
  35. Style transfer from non-parallel text by cross-alignment. Advances in neural information processing systems, 30, 2017.
  36. Varshini Subhash. Can large language models change user preference adversarially? arXiv preprint arXiv:2302.10291, 2023.
  37. Alignment for advanced machine learning systems. Ethics of Artificial Intelligence, pp.  342–382, 2016.
  38. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  39. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
  40. A study of implicit bias in pretrained language models against people with disabilities. In Proceedings of the 29th International Conference on Computational Linguistics, pp.  1324–1332, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.113.
  41. Universal adversarial triggers for attacking and analyzing NLP. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2153–2162, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1221. URL https://aclanthology.org/D19-1221.
  42. Taxonomy of risks posed by language models. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, pp.  214–229, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393522. doi: 10.1145/3531146.3533088. URL https://doi.org/10.1145/3531146.3533088.
  43. Colin G West. Advances in apparent conceptual physics reasoning in gpt-4. arXiv e-prints, pp.  arXiv–2303, 2023.
  44. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023.
  45. Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  2950–2968, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.235. URL https://aclanthology.org/2021.naacl-main.235.
  46. Automatically exposing problems with neural dialog models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  456–470, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.37. URL https://aclanthology.org/2021.emnlp-main.37.
  47. Eliezer Yudkowsky. Creating friendly ai 1.0: The analysis and design of benevolent goal architectures. The Singularity Institute, San Francisco, USA, 2001.
  48. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023a.
  49. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023b.
Citations (5)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube