Human Alignment of Large Language Models through Online Preference Optimisation
Abstract: Ensuring alignment of LLMs' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD. This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model. Optimising the IPO loss with such a stream of data becomes then equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm. We compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task.
- Concrete problems in AI safety. arXiv, 2016.
- PaLM 2 technical report, 2023.
- A general theoretical paradigm to understand learning from human preferences. arXiv, 2023.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv, 2022a.
- Constitutional AI: Harmlessness from AI feedback. arXiv, 2022b.
- Convex Optimization. Cambridge University Press, 2004.
- Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
- Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv, 2023.
- Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017.
- Reward model ensembles help mitigate overoptimization. arXiv, 2023.
- RAFT: Reward rAnked FineTuning for generative foundation model alignment. arXiv, 2023.
- Helping or herding? Rward model ensembles mitigate but do not eliminate reward hacking. arXiv, 2023.
- Off-policy deep reinforcement learning without exploration. In Proceedings of the International Conference on Machine Learning, 2019.
- Scaling laws for reward model overoptimization. In Proceedings of the International Conference on Machine Learning, 2022.
- Improving alignment of dialogue agents via targeted human judgements. arXiv, 2022.
- Policy shaping: Integrating human feedback with reinforcement learning. In Advances in Neural Information Processing Systems, 2013.
- Contrastive prefence learning: Learning from human feedback without RL. arXiv, 2023.
- Camels in a changing climate: Enhancing LM adaptation with Tulu 2. arXiv, 2023.
- Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv, 2019.
- TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the Annual International Symposium on Computer Architecture, 2023.
- Understanding the effects of RLHF on LLM generalisation and diversity. arXiv, 2023.
- TAMER: Training an agent manually via evaluative reinforcement. In Proceedings of the IEEE International Conference on Development and Learning, 2008.
- RLAIF: Scaling reinforcement learning from human feedback with AI feedback. arXiv, 2023.
- Statistical rejection sampling improves preference optimization. arXiv, 2023.
- Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2016.
- Nash learning from human feedback. arXiv, 2023.
- WebGPT: Browser-assisted question-answering with human feedback. arXiv, 2021.
- OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
- Training language models to follow instructions with human feedback. arXiv, 2022.
- The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv, 2022.
- Reward gaming in conditional text generation. In Annual Meeting of the Association for Computational Linguistics, 2022.
- Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, 2023.
- WARM: On the benefits of weight averaged reward models. arXiv, 2024.
- Scaling up models and data with t5x and seqio. arXiv, 2022.
- Proximal policy optimization algorithms. arXiv, 2017.
- Don’t Give Me the Details, Just the Summary! Topic-aware convolutional neural networks for extreme summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018.
- Adafactor: Adaptive learning rates with sublinear memory cost. arXiv, 2018.
- Benchmarks and algorithms for offline preference-based reward learning. arXiv, 2023.
- A long way to go: Investigating length correlations in RLHF. arXiv, 2023.
- Defining and characterizing reward gaming. In Advances in Neural Information Processing Systems, 2022.
- Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, 2020.
- A minimaximalist approach to reinforcement learning from human feedback. arXiv, 2024.
- Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749, 2024.
- Llama 2: Open foundation and fine-tuned chat models. arXiv, 2023.
- Zephyr: Direct distillation of LM alignment. arXiv, 2023.
- TL;DR: Mining Reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization. Association for Computational Linguistics, 2017.
- Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints. arXiv, 2023.
- Deep TAMER: Interactive agent shaping in high-dimensional state spaces. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proceedings of the International Conference on Machine Learning, 2022.
- Behavior regularized offline reinforcement learning. arXiv, 2019.
- Self-rewarding language models, 2024.
- RRHF: Rank responses to align language models with human feedback without tears. arXiv, 2023.
- SLiC-HF: Sequence likelihood calibration with human feedback. arXiv, 2023.
- Consequences of misaligned AI. In Advances in Neural Information Processing Systems, 2020.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.