Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Human Alignment of Large Language Models through Online Preference Optimisation (2403.08635v1)

Published 13 Mar 2024 in cs.LG, cs.AI, and stat.ML
Human Alignment of Large Language Models through Online Preference Optimisation

Abstract: Ensuring alignment of LLMs' outputs with human preferences is critical to guarantee a useful, safe, and pleasant user experience. Thus, human alignment has been extensively studied recently and several methods such as Reinforcement Learning from Human Feedback (RLHF), Direct Policy Optimisation (DPO) and Sequence Likelihood Calibration (SLiC) have emerged. In this paper, our contribution is two-fold. First, we show the equivalence between two recent alignment methods, namely Identity Policy Optimisation (IPO) and Nash Mirror Descent (Nash-MD). Second, we introduce a generalisation of IPO, named IPO-MD, that leverages the regularised sampling approach proposed by Nash-MD. This equivalence may seem surprising at first sight, since IPO is an offline method whereas Nash-MD is an online method using a preference model. However, this equivalence can be proven when we consider the online version of IPO, that is when both generations are sampled by the online policy and annotated by a trained preference model. Optimising the IPO loss with such a stream of data becomes then equivalent to finding the Nash equilibrium of the preference model through self-play. Building on this equivalence, we introduce the IPO-MD algorithm that generates data with a mixture policy (between the online and reference policy) similarly as the general Nash-MD algorithm. We compare online-IPO and IPO-MD to different online versions of existing losses on preference data such as DPO and SLiC on a summarisation task.

Human Alignment of LLMs through Online Preference Optimization

Overview of the Paper

This paper contributes significantly to the field of human alignment in LLMs, focusing on the mechanism of online preference optimization. It provides a novel perspective on aligning LLMs with human preferences, leveraging the regularized sampling approach. The authors initially establish the equivalency between two prominent alignment methods: Identity Policy Optimization (IPO) and Nash Mirror Descent (Nash-MD), underpinning their analysis with theoretical insights that unravel this unexpected connection. Subsequently, they introduce a generalization of IPO, named IPO-MD, which amalgamates the strengths of both initial methods, shedding light on potential pathways for refining preference optimization in LLMs.

Theoretical Insights

The equivalence between IPO and Nash-MD is a surprising reassessment of our understanding of alignment methods in LLMs. Traditionally viewed through separate lenses—offline for IPO and online with a preference model for Nash-MD—their equivalence is proven under the conditions of an online variant of IPO. This variant relies on online policy sampling and a human-trained preference model, establishing a connection with the Nash equilibrium of the preference model through self-play optimization. This theoretical underpinning pivots the discourse towards viewing these approaches as complementary rather than disparate, with each possessing unique advantages in the context of human alignment.

IPO-MD Algorithm

Building on the theoretical equivalence, the authors propose the IPO-MD algorithm, a blend of online IPO and the Nash-MD framework. IPO-MD targets the best aspects of both: it is a contrastive algorithm leveraging online data sampling with a mixture policy that combines the online policy dynamics with those of a reference policy. This hybrid approach aims to strike a balance between learning efficiency and the flexibility of adjusting to varying data distributions, promising to enhance the alignment process.

Experimental Insights

The paper presents an in-depth experimental analysis, comparing online IPO and IPO-MD against other prevailing methods like DPO and SLiC, particularly focusing on a summarization task. The results underpin the robustness of IPO-MD in aligning LLM outputs with human preferences, delineating its competency in navigating the complex landscape of preference optimization. These experiments act as a testament to the practical viability of the theoretically motivated IPO-MD algorithm, offering a promising outlook for future developments in LLM alignment.

Implications and Future Directions

The theoretical analysis and empirical results presented in this paper underscore the potential of combining offline and online optimization strategies for aligning LLMs with human preferences more effectively. The findings suggest that hybrid approaches, exemplified by IPO-MD, could offer more nuanced control over the alignment process, potentially mitigating issues related to reward hacking and overoptimization. This research opens up new avenues for exploring advanced preference optimization methods, suggesting a future where LLMs can be more reliably and efficiently aligned with human values and expectations.

As the field of generative AI and LLMs continues to evolve, the insights from this paper may guide the development of more sophisticated and human-aligned models. Further research could expand the applicability of these findings across diverse tasks and settings, enhancing our understanding of the intricate relationship between LLM behavior and human preferences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Concrete problems in AI safety. arXiv, 2016.
  2. PaLM 2 technical report, 2023.
  3. A general theoretical paradigm to understand learning from human preferences. arXiv, 2023.
  4. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv, 2022a.
  5. Constitutional AI: Harmlessness from AI feedback. arXiv, 2022b.
  6. Convex Optimization. Cambridge University Press, 2004.
  7. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  8. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv, 2023.
  9. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017.
  10. Reward model ensembles help mitigate overoptimization. arXiv, 2023.
  11. RAFT: Reward rAnked FineTuning for generative foundation model alignment. arXiv, 2023.
  12. Helping or herding? Rward model ensembles mitigate but do not eliminate reward hacking. arXiv, 2023.
  13. Off-policy deep reinforcement learning without exploration. In Proceedings of the International Conference on Machine Learning, 2019.
  14. Scaling laws for reward model overoptimization. In Proceedings of the International Conference on Machine Learning, 2022.
  15. Improving alignment of dialogue agents via targeted human judgements. arXiv, 2022.
  16. Policy shaping: Integrating human feedback with reinforcement learning. In Advances in Neural Information Processing Systems, 2013.
  17. Contrastive prefence learning: Learning from human feedback without RL. arXiv, 2023.
  18. Camels in a changing climate: Enhancing LM adaptation with Tulu 2. arXiv, 2023.
  19. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv, 2019.
  20. TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. In Proceedings of the Annual International Symposium on Computer Architecture, 2023.
  21. Understanding the effects of RLHF on LLM generalisation and diversity. arXiv, 2023.
  22. TAMER: Training an agent manually via evaluative reinforcement. In Proceedings of the IEEE International Conference on Development and Learning, 2008.
  23. RLAIF: Scaling reinforcement learning from human feedback with AI feedback. arXiv, 2023.
  24. Statistical rejection sampling improves preference optimization. arXiv, 2023.
  25. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2016.
  26. Nash learning from human feedback. arXiv, 2023.
  27. WebGPT: Browser-assisted question-answering with human feedback. arXiv, 2021.
  28. OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
  29. Training language models to follow instructions with human feedback. arXiv, 2022.
  30. The effects of reward misspecification: Mapping and mitigating misaligned models. arXiv, 2022.
  31. Reward gaming in conditional text generation. In Annual Meeting of the Association for Computational Linguistics, 2022.
  32. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, 2023.
  33. WARM: On the benefits of weight averaged reward models. arXiv, 2024.
  34. Scaling up models and data with t5x and seqio. arXiv, 2022.
  35. Proximal policy optimization algorithms. arXiv, 2017.
  36. Don’t Give Me the Details, Just the Summary! Topic-aware convolutional neural networks for extreme summarization. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2018.
  37. Adafactor: Adaptive learning rates with sublinear memory cost. arXiv, 2018.
  38. Benchmarks and algorithms for offline preference-based reward learning. arXiv, 2023.
  39. A long way to go: Investigating length correlations in RLHF. arXiv, 2023.
  40. Defining and characterizing reward gaming. In Advances in Neural Information Processing Systems, 2022.
  41. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, 2020.
  42. A minimaximalist approach to reinforcement learning from human feedback. arXiv, 2024.
  43. Generalized preference optimization: A unified approach to offline alignment. arXiv preprint arXiv:2402.05749, 2024.
  44. Llama 2: Open foundation and fine-tuned chat models. arXiv, 2023.
  45. Zephyr: Direct distillation of LM alignment. arXiv, 2023.
  46. TL;DR: Mining Reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization. Association for Computational Linguistics, 2017.
  47. Beyond reverse KL: Generalizing direct preference optimization with diverse divergence constraints. arXiv, 2023.
  48. Deep TAMER: Interactive agent shaping in high-dimensional state spaces. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  49. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proceedings of the International Conference on Machine Learning, 2022.
  50. Behavior regularized offline reinforcement learning. arXiv, 2019.
  51. Self-rewarding language models, 2024.
  52. RRHF: Rank responses to align language models with human feedback without tears. arXiv, 2023.
  53. SLiC-HF: Sequence likelihood calibration with human feedback. arXiv, 2023.
  54. Consequences of misaligned AI. In Advances in Neural Information Processing Systems, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Daniele Calandriello (34 papers)
  2. Daniel Guo (7 papers)
  3. Mark Rowland (57 papers)
  4. Yunhao Tang (63 papers)
  5. Bernardo Avila Pires (21 papers)
  6. Pierre Harvey Richemond (5 papers)
  7. Charline Le Lan (15 papers)
  8. Michal Valko (91 papers)
  9. Tianqi Liu (49 papers)
  10. Rishabh Joshi (23 papers)
  11. Zeyu Zheng (60 papers)
  12. Bilal Piot (40 papers)
  13. Remi Munos (45 papers)
Citations (41)