Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences (2404.03715v1)

Published 4 Apr 2024 in cs.LG, cs.AI, and cs.CL
Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences

Abstract: This paper studies post-training LLMs using preference feedback from a powerful oracle to help a model iteratively improve over itself. The typical approach for post-training LLMs involves Reinforcement Learning from Human Feedback (RLHF), which traditionally separates reward learning and subsequent policy optimization. However, such a reward maximization approach is limited by the nature of "point-wise" rewards (such as Bradley-Terry model), which fails to express complex intransitive or cyclic preference relations. While advances on RLHF show reward learning and policy optimization can be merged into a single contrastive objective for stability, they yet still remain tethered to the reward maximization framework. Recently, a new wave of research sidesteps the reward maximization presumptions in favor of directly optimizing over "pair-wise" or general preferences. In this paper, we introduce Direct Nash Optimization (DNO), a provable and scalable algorithm that marries the simplicity and stability of contrastive learning with theoretical generality from optimizing general preferences. Because DNO is a batched on-policy algorithm using a regression-based objective, its implementation is straightforward and efficient. Moreover, DNO enjoys monotonic improvement across iterations that help it improve even over a strong teacher (such as GPT-4). In our experiments, a resulting 7B parameter Orca-2.5 model aligned by DNO achieves the state-of-the-art win-rate against GPT-4-Turbo of 33% on AlpacaEval 2.0 (even after controlling for response length), an absolute gain of 26% (7% to 33%) over the initializing model. It outperforms models with far more parameters, including Mistral Large, Self-Rewarding LM (70B parameters), and older versions of GPT-4.

Exploring Direct Nash Optimization for Self-Improving LLMs

Introduction to Direct Nash Optimization

In the field of artificial intelligence research, particularly in the development of LLMs, optimizing the alignment of LLMs with complex human preferences has emerged as a significant challenge. Traditional approaches to post-training LLMs, such as Reinforcement Learning from Human Feedback (RLHF), have focused on reward maximization based on scalar rewards. However, this methodology encounters limitations when expressing general preferences, especially in the context of intransitive or cyclic preference relations. Addressing this challenge, the paper on Direct Nash Optimization (\DNO) presents a novel framework that diverges from the conventional reward-focused paradigm, embracing the optimization of general preferences through a scalable, contrastive learning-based algorithm.

Key Contributions of the Study

The paper introduces \DNO, an algorithm that combines the theoretical robustness associated with optimizing general preferences with the practical efficiency and stability of contrastive learning. The following points summarize the critical contributions and findings of this work:

  1. Algorithmic Foundation: \DNO leverages batched on-policy iterations alongside a regression-based objective, facilitating a stable and efficient approach to optimizing general preferences. This methodology sidesteps the need for explicit reward function computation.
  2. Theoretical Insights: The paper demonstrates, through theoretical analysis, that \DNO converges to a Nash equilibrium on average, offering a foundational mathematical underpinning for its approach to learning from general preference feedback.
  3. Practical Efficacy: Empirical evaluations showcase that \DNO, when applied to a 7B parameter LLM, outperforms its counterparts, achieving record performance on standard benchmarks such as \alpaca.
  4. Monotonic Improvement: \DNO is proven to exhibit monotonic improvement across iterations, ensuring consistent progress in aligning the LLM with the targeted preferences.

Theoretical and Practical Implications

The exploration of \DNO contributes significantly to both the theoretical understanding and practical applications of post-training LLMs with human feedback. Specifically, the paper sheds light on the following aspects:

  • Expressing Complex Preferences: By moving away from scalar reward functions, \DNO addresses the critical limitation of expressing complex, non-linear preferences, paving the way for more nuanced LLM tuning.
  • Stability and Efficiency: The batched on-policy approach, combined with regression-based objectives, marks a stride towards achieving both theoretical soundness and practical efficiency in learning from human feedback.
  • Benchmark Performance: The state-of-the-art performance of the resultant 7B parameter model underscores \DNO's effectiveness in real-world applications, suggesting its potential as a new standard for post-training LLMs.

Future Directions

While \DNO marks a significant advancement in the alignment of LLMs with human preferences, it also opens avenues for further exploration. Future work could focus on extending the algorithm to broader applications beyond text generation, exploring the integration of \DNO with other LLM architectures, and further refining the algorithm for even greater efficiency and scalability.

Conclusion

The development and paper of Direct Nash Optimization represent a noteworthy advancement in optimizing LLMs for alignment with human preferences. By theoretically and empirically demonstrating the effectiveness of this approach, the research sets a new precedent for future endeavors in fine-tuning LLMs in a manner that more accurately reflects the intricacies of human preferences.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. The cringe loss: Learning what language not to model. arXiv preprint arXiv:2211.05826, 2022.
  2. April: Active preference learning-based reinforcement learning. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Proceedings, Part II 23, pages 116–131. Springer, 2012.
  3. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
  4. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036, 2023.
  5. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022a.
  6. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022b.
  7. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  8. On the limitations of the elo, real-world games are transitive, not additive. In International Conference on Artificial Intelligence and Statistics, pages 2905–2921. PMLR, 2023.
  9. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  10. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  11. Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
  12. Human alignment of large language models through online preference optimisation. arXiv preprint arXiv:2403.08635, 2024.
  13. Prediction, learning, and games. Cambridge university press, 2006.
  14. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR, 2019.
  15. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335, 2024.
  16. Adversarial preference optimization. arXiv preprint arXiv:2311.08045, 2023.
  17. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  18. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  19. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
  20. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
  21. Raft: Reward ranked finetuning for generative foundation model alignment. Transactions on Machine Learning Research, 2023.
  22. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2023.
  23. Contextual dueling bandits. In Conference on Learning Theory, pages 563–587. PMLR, 2015.
  24. Arpad E. Elo. The rating of chessplayers, past and present. Arco Pub., New York, 1978. ISBN 0668047216 9780668047210. URL http://www.amazon.com/Rating-Chess-Players-Past-Present/dp/0668047216.
  25. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306, 2024.
  26. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852, 2016a.
  27. Guided cost learning: Deep inverse optimal control via policy optimization. In International conference on machine learning, pages 49–58. PMLR, 2016b.
  28. Peter C. Fishburn. Probabilistic social choice based on simple voting comparisons. The Review of Economic Studies, 51(4):683–692, 1984.
  29. Efficient first-order contextual bandits: Prediction, allocation, and triangular discrimination. Advances in Neural Information Processing Systems, 34:18907–18919, 2021.
  30. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  31. Policy shaping: Integrating human feedback with reinforcement learning. Advances in neural information processing systems, 26, 2013.
  32. Reinforced self-training (rest) for language modeling. arXiv preprint arXiv:2308.08998, 2023.
  33. Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024.
  34. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR, 2018.
  35. Elad Hazan et al. Introduction to online convex optimization. Foundations and Trends® in Optimization, 2(3-4):157–325, 2016.
  36. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pages 267–274, 2002.
  37. Sham M. Kakade. A natural policy gradient. Advances in neural information processing systems, 14, 2001.
  38. Efficient algorithms for online decision problems. Journal of Computer and System Sciences, 71(3):291–307, 2005.
  39. Tamer: Training an agent manually via evaluative reinforcement. In 2008 7th IEEE international conference on development and learning, pages 292–297. IEEE, 2008.
  40. Gerald H. Kramer. On a class of equilibrium conditions for majority rule. Econometrica: Journal of the Econometric Society, pages 285–297, 1973.
  41. Germain Kreweras. Aggregation of preference orderings. In Mathematics and Social Sciences I: Proceedings of the seminars of Menthon-Saint-Bernard, France (1–27 July 1960) and of Gösing, Austria (3–27 July 1962), pages 73–79, 1965.
  42. Bandit algorithms. Cambridge University Press, 2020.
  43. Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
  44. Lipo: Listwise preference optimization through learning-to-rank. arXiv preprint arXiv:2402.01878, 2024a.
  45. Statistical rejection sampling improves preference optimization. In The Twelfth International Conference on Learning Representations, 2024b.
  46. Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045, 2023.
  47. Orca-math: Unlocking the potential of slms in grade school math. arXiv preprint arXiv:2402.14830, 2024.
  48. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PMLR, 2016.
  49. Nash learning from human feedback. arXiv preprint arXiv:2312.00886, 2023.
  50. Problem complexity and method efficiency in optimization. Wiley-Interscience, 1983.
  51. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  52. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  53. Art B. Owen. Monte Carlo theory, methods and examples. https://artowen.su.domains/mc/, 2013.
  54. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2023.
  55. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.
  56. Axiomatic preference modeling for longform question answering. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 11445–11475, 2023.
  57. Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR, 2015.
  58. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  59. Shai Shalev-Shwartz et al. Online learning and online convex optimization. Foundations and Trends® in Machine Learning, 4(2):107–194, 2012.
  60. Paul B. Simpson. On defining areas of voter choice: Professor tullock on stable voting. The Quarterly Journal of Economics, 83(3):478–490, 1969.
  61. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585, 2023.
  62. Preference ranking optimization for human alignment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18990–18998, 2024.
  63. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  64. A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056, 2024.
  65. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  66. snorkelai/snorkel-mistral-pairrm-dpo, 2024. https://huggingface.co/snorkelai/Snorkel-Mistral-PairRM-DPO.
  67. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  68. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
  69. Is rlhf more difficult than standard rl? arXiv preprint arXiv:2306.14111, 2023.
  70. A survey of preference-based reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46, 2017.
  71. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34:6683–6694, 2021.
  72. The role of coverage in online reinforcement learning. In The Eleventh International Conference on Learning Representations, 2023.
  73. Iterative preference learning from human feedback: Bridging theory and practice for rlhf under kl-constraint. arXiv preprint arXiv:2312.11456, 2023.
  74. Wizardlm: Empowering large language models to follow complex instructions, 2023a.
  75. Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682, 2023b.
  76. Rlcd: Reinforcement learning from contrast distillation for language model alignment. arXiv preprint arXiv:2307.12950, 2023.
  77. Metamath: Bootstrap your own mathematical questions for large language models, 2023.
  78. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  79. Scaling relationship on learning mathematical reasoning with large language models. arXiv preprint arXiv:2308.01825, 2023a.
  80. Rrhf: Rank responses to align language models with human feedback without tears. arXiv preprint arXiv:2304.05302, 2023b.
  81. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  82. Provable offline preference-based reinforcement learning. In The Twelfth International Conference on Learning Representations, 2024.
  83. Calibrating sequence likelihood improves conditional language generation. In The Eleventh International Conference on Learning Representations, 2023.
  84. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Corby Rosset (21 papers)
  2. Ching-An Cheng (48 papers)
  3. Arindam Mitra (40 papers)
  4. Michael Santacroce (5 papers)
  5. Ahmed Awadallah (27 papers)
  6. Tengyang Xie (29 papers)
Citations (78)
Youtube Logo Streamline Icon: https://streamlinehq.com