Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Nash Learning from Human Feedback (2312.00886v4)

Published 1 Dec 2023 in stat.ML, cs.AI, cs.GT, cs.LG, and cs.MA

Abstract: Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning LLMs with human preferences. Typically, RLHF involves the initial step of learning a reward model from human feedback, often expressed as preferences between pairs of text generations produced by a pre-trained LLM. Subsequently, the LLM's policy is fine-tuned by optimizing it to maximize the reward model through a reinforcement learning algorithm. However, an inherent limitation of current reward models is their inability to fully represent the richness of human preferences and their dependency on the sampling distribution. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a preference model, which is conditioned on two inputs given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. To demonstrate the effectiveness of our approach, we present experimental results involving the fine-tuning of a LLM for a text summarization task. We believe NLHF offers a compelling avenue for preference learning and policy optimization with the potential of advancing the field of aligning LLMs with human preferences.

An Analysis of Nash Learning from Human Feedback in LLMs

The paper "Nash Learning from Human Feedback" presents a nuanced exploration of aligning LLMs with human preferences through a novel approach leveraging game-theoretic principles. This paper introduces an alternative to the traditional Reinforcement Learning from Human Feedback (RLHF) paradigm by focusing on direct preference modeling and computing Nash equilibria. The authors argue for the use of preference models as a more expressive and effective mechanism than reward models for capturing human preferences in the context of LLM fine-tuning.

Key Contributions

  1. Preference Model vs. Reward Model: The paper emphasizes the limitations of traditional reward models, often based on the Bradley-Terry model or Elo ratings, suggesting that they fail to capture the richness and complexity of human preferences. The authors propose leveraging preference models that handle non-transitive preferences and are distributionally robust, making them less sensitive to the policy used for data collection.
  2. Nash Equilibrium as an Objective: The core proposition is to shift from optimizing a reward model to optimizing the Nash equilibrium of a preference model. The Nash equilibrium is argued to inherently align better with the diversity of human preferences by encapsulating the concept of mutual best responses in a game-theoretic context.
  3. Algorithmic Innovation with Nash-MD: The paper introduces the Nash-MD algorithm, a novel variant of mirror descent designed to converge to the Nash equilibrium of the regularized preference model. This algorithm performs a mirror descent step targeting a mixture policy that balances between the initial and current policies, providing a scalable and effective mechanism for policy optimization without the need for extensive memory to store past policies.
  4. Experimental Analysis: The paper presents comprehensive experimentation on text summarization tasks to demonstrate the efficacy of the proposed Nash learning approach. The results indicate that leveraging a preference model and Nash equilibrium provides improved alignment with human preferences compared to RLHF baselines.

Implications and Speculation on AI Developments

The implications of this research are substantial both in theoretical advancements and practical applications. Theoretically, the use of Nash equilibrium in machine learning contexts offers a promising direction for more robust and interpretable model training paradigms, particularly in environments where preferences are diverse and possibly conflicting. Practically, this approach can potentially improve the way AI systems, especially conversational and assistive agents, interact with users by better understanding and aligning with human intentions.

This work could catalyze further exploration into incorporating game-theoretic concepts into AI training and model optimization. Future developments may include exploring different game-theoretic solution concepts or extending Nash equilibrium frameworks to multi-agent systems where interactions become even more complex.

Conclusion

In conclusion, "Nash Learning from Human Feedback" represents a significant stride toward refining LLMs and their alignment with human expectations. By advocating for a preference-centric approach and employing Nash equilibria, this research provides a compelling alternative to the conventional RLHF framework. It opens the door to advancing upon the overarching goal of more naturally integrated AI systems capable of decision-making that resonates with human values and social norms. Future investigations will likely delve into scalability, the integration of more comprehensive feedback mechanisms, and the expansion of these concepts to broader AI domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (76)
  1. Faster rates for convex-concave games. Proceedings of the Annual Conference on Learning Theory, 2018.
  2. Preference-based policy learning. In D. Gunopulos, T. Hofmann, D. Malerba, and M. Vazirgiannis, editors, Machine Learning and Knowledge Discovery in Databases, 2011.
  3. Concrete problems in AI safety. arXiv, 2016.
  4. PaLM 2 technical report, 2023.
  5. A general theoretical paradigm to understand learning from human preferences. arXiv, 2023.
  6. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv, 2022.
  7. On the limitations of the Elo: Real-world games are transitive, not additive. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2023.
  8. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  9. G. W. Brown. Iterative solution of games by fictitious play. Act. Anal. Prod Allocation, 13(1):374, 1951.
  10. S. Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–357, 2015.
  11. Preference-based evolutionary direct policy search. In Autonomous Learning Workshop at the IEEE International Conference on Robotics and Automation, 2013.
  12. Preference-based reinforcement learning: Evolutionary direct policy search using a preference-based racing algorithm. Machine Learning, 97(3):327–351, 2014.
  13. How to scale your EMA. arXiv, 2023.
  14. N. Cesa-Biachi and G. Lugosi. Predition, Learning, and Games. Cambridge University Press, 2006.
  15. Human-in-the-loop: Provably efficient preference-based reinforcement learning with general function approximation. arXiv, 2022.
  16. Preference-based policy iteration: Leveraging preference learning for reinforcement learning. In D. Gunopulos, T. Hofmann, D. Malerba, and M. Vazirgiannis, editors, Machine Learning and Knowledge Discovery in Databases, 2011.
  17. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems, 2017.
  18. The use of confidence or fiducial limits illustrated in the case of the binomial. Biometrika, 26(4):404–413, 1934.
  19. I. Csiszar and J. Korner. Information Theory: Coding Theorems for Discrete Memoryless Systems. Academic Press, Inc., USA, 1982.
  20. C. Daskalakis and I. Panageas. Last-iterate convergence: Zero-sum games and constrained min-max optimization. In Proceedings of the Conference on Innovations in Theoretical Computer Science, 2019.
  21. Near-optimal no-regret algorithms for zero-sum games. In ACM-SIAM Symposium on Discrete Algorithms, 2011.
  22. Reinforcement learning with trajectory feedback. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  23. A. E. Elo. The Rating of Chessplayers, Past and Present. Arco Pub., 1978.
  24. Optimistic regret minimization for extensive-form games via dilated distance-generating functions. In Advances in Neural Information Processing Systems, 2019.
  25. D. Fudenberg and D. K. Levine. The theory of learning in games. MIT Press, 1998.
  26. M. Gardner. The paradox of the nontransitive dice. Scientific American, (223):110–111, 1970.
  27. A theory of regularized Markov decision processes. In Proceedings of the International Conference on Machine Learning, 2019.
  28. Frank-Wolfe algorithms for saddle point problems. In Proceedings of the Artificial Intelligence and Statistics, 2016.
  29. Improving alignment of dialogue agents via targeted human judgements. arXiv, 2022.
  30. Bootstrap your own latent: A new approach to self-supervised learning. In Advances in Neural Information Processing Systems, 2020.
  31. J. Heinrich and D. Silver. Deep reinforcement learning from self-play in imperfect-information games. arXiv, 2016.
  32. Fictitious self-play in extensive-form games. In Proceedings of the International Conference on Machine Learning, 2015.
  33. Neural replicator dynamics: Multiagent learning via hedging policy gradients. In Proceedings of the International Joint Conference on Autonomous Agents and Multiagent Systems, 2020.
  34. Smoothing techniques for computing Nash equilibria of sequential games. Mathematics of Operations Research, 35(2):494–512, 2010.
  35. J. Hofbauer and S. Sorin. Best response dynamics for continuous zero-sum games. Discrete and Continuous Dynamical Systems Series B, 6(1):215, 2006.
  36. Way off-policy batch deep reinforcement learning of implicit human preferences in dialog. arXiv, 2019.
  37. Let’s be honest: An optimal no-regret framework for zero-sum games. In Proceedings of the International Conference on Machine Learning, 2018.
  38. A. Y. Klimenko. Intransitivity in theory and in the real world. Entropy, 17(6):4364–4412, 2015.
  39. G. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976.
  40. T. Lattimore and C. Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
  41. RLAIF: Scaling reinforcement learning from human feedback with AI feedback. arXiv, 2023.
  42. Teaching language models to support answers with verified quotes. arXiv, 2022.
  43. Cycles in adversarial regularized learning. In ACM-SIAM Symposium on Discrete Algorithms, 2018.
  44. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile. In Proceedings of the International Conference on Learning Representations, 2019.
  45. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In Proceedings of the International Conference on Artificial Intelligence and Statistics, 2020.
  46. Fast computation of Nash equilibria in imperfect information games. In Proceedings of the International Conference on Machine Learning, 2020.
  47. WebGPT: Browser-assisted question-answering with human feedback. arXiv, 2021.
  48. A. Nemirovski and D. Yudin. Problem complexity and method efficiency in optimization. Wiley-Interscience Series in Discrete Mathematics, 1983.
  49. Y. Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal on Optimization, 16(1):235–249, 2005.
  50. Dueling posterior sampling for preference-based reinforcement learning. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2020.
  51. OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
  52. OpenAI. GPT-4 technical report. arXiv, 2023.
  53. Training language models to follow instructions with human feedback. arXiv, 2022.
  54. Dueling RL: Reinforcement learning with trajectory preferences. arXiv, 2023.
  55. From Poincaré recurrence to convergence in imperfect information games: Finding equilibrium via regularization. In Proceedings of the International Conference on Machine Learning, 2021.
  56. Mastering the game of Stratego with model-free multiagent reinforcement learning. Science, 378(6623):990–996, 2022.
  57. Direct preference optimization: Your language model is secretly a reward model. In Advances in Neural Information Processing Systems, 2023.
  58. S. Rakhlin and K. Sridharan. Optimization, learning, and games with predictable sequences. In Advances in Neural Information Processing Systems, 2013.
  59. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In Advances in Neural Information Processing Systems, 2023.
  60. Scaling up models and data with t5x and seqio. arXiv, 2022.
  61. J. Robinson. An iterative method of solving a game. Annals of Mathematics, 54(2):296–301, 1951.
  62. Factually consistent summarization via reinforcement learning with textual entailment feedback. arXiv, 2023.
  63. J. B. Rosen. Existence and uniqueness of equilibrium points for concave n𝑛nitalic_n-person games. Econometrica: Journal of the Econometric Society, pages 520–534, 1965.
  64. Proximal policy optimization algorithms. arXiv, 2017.
  65. M. Sion. On general minimax theorems. Pacific Journal of mathematics, 8(1):171–176, 1958.
  66. Learning to summarize with human feedback. In Advances in Neural Information Processing Systems, 2020.
  67. Fast convergence of regularized learning in games. In Advances in Neural Information Processing Systems, 2015.
  68. A. Tversky. Intransitivity of preferences. Psychological Review, 76(1):31–48, 1969.
  69. TL;DR: Mining Reddit to learn automatic summarization. In Proceedings of the Workshop on New Frontiers in Summarization. Association for Computational Linguistics, 2017.
  70. J. von Neumann. Zur Theorie der Gesellschaftsspiele. Mathematische Annalen, 100(1):295–320, 1928.
  71. Is RLHF more difficult than standard RL? arXiv, 2023.
  72. A Bayesian approach for policy learning from trajectory preference queries. In Advances in Neural Information Processing Systems, 2012.
  73. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46, 2017.
  74. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proceedings of the International Conference on Machine Learning, 2022.
  75. SLiC-HF: Sequence likelihood calibration with human feedback. arXiv, 2023.
  76. Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems, 2007.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (17)
  1. Rémi Munos (121 papers)
  2. Michal Valko (91 papers)
  3. Daniele Calandriello (34 papers)
  4. Mohammad Gheshlaghi Azar (31 papers)
  5. Mark Rowland (57 papers)
  6. Yunhao Tang (63 papers)
  7. Matthieu Geist (93 papers)
  8. Andrea Michi (6 papers)
  9. Marco Selvi (6 papers)
  10. Sertan Girgin (24 papers)
  11. Nikola Momchev (12 papers)
  12. Olivier Bachem (52 papers)
  13. Daniel J. Mankowitz (28 papers)
  14. Doina Precup (206 papers)
  15. Bilal Piot (40 papers)
  16. Zhaohan Daniel Guo (15 papers)
  17. Thomas Mesnard (18 papers)
Citations (88)