Aligning Large Language Models with Human Preferences through Representation Engineering (2312.15997v3)
Abstract: Aligning LLMs with human preferences is crucial for enhancing their utility in terms of helpfulness, truthfulness, safety, harmlessness, and interestingness. Existing methods for achieving this alignment often involves employing reinforcement learning from human feedback (RLHF) to fine-tune LLMs based on human labels assessing the relative quality of model responses. Nevertheless, RLHF is susceptible to instability during fine-tuning and presents challenges in implementation.Drawing inspiration from the emerging field of representation engineering (RepE), this study aims to identify relevant representations for high-level human preferences embedded in patterns of activity within an LLM, and achieve precise control of model behavior by transforming its representations. This novel approach, denoted as Representation Alignment from Human Feedback (RAHF), proves to be effective, computationally efficient, and easy to implement.Extensive experiments demonstrate the efficacy of RAHF in not only capturing but also manipulating representations to align with a broad spectrum of human preferences or values, rather than being confined to a singular concept or function (e.g. honesty or bias). RAHF's versatility in accommodating diverse human preferences shows its potential for advancing LLM performance.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Language models are few-shot learners. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS 2020).
- Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv preprint arXiv:2303.12712.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’18).
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
- Languages are rewards: Hindsight finetuning using human feedback. arXiv preprint arXiv:2302.02676.
- OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
- Training language models to follow instructions with human feedback. In Proceedings of the Conference on Neural Information Processing Systems (NeurIPS 2022).
- Improving language understanding by generative pre-training.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. In Proceedings of the International Conference on Learning Representations (ICLR’23).
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
- Learning to summarize with human feedback. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS’20).
- Discriminative deep Dyna-Q: Robust planning for dialogue policy learning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP’18).
- Guided dialog policy learning: Reward estimation for multi-domain task-oriented dialog. In Proceedings of the Conference on Empirical Methods in Natural Language Processing and the International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19).
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- LaMDA: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Hallucination detection for generative large language models by bayesian sequential estimation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15361–15371.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Ronald J Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8:229–256.
- Parameter efficient multi-task fine-tuning by learning to transfer token-wise prompts. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8734–8746.
- RRHF: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302.
- The wisdom of Hindsight makes language models better instruction followers. In Proceedings of the International Conference on Machine Learning (ICML’23).
- Budgeted policy learning for task-oriented dialogue systems. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL’19).
- SLIC-HF: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Seq2SQL: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103.
- Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
- Representation engineering: A top-down approach to AI transparency. arXiv preprint arXiv:2310.01405.