Papers
Topics
Authors
Recent
2000 character limit reached

COMAL: A Convergent Meta-Algorithm for Aligning LLMs with General Preferences (2410.23223v1)

Published 30 Oct 2024 in cs.LG, cs.AI, cs.CL, and cs.GT

Abstract: Many alignment methods, including reinforcement learning from human feedback (RLHF), rely on the Bradley-Terry reward assumption, which is insufficient to capture the full range of general human preferences. To achieve robust alignment with general preferences, we model the alignment problem as a two-player zero-sum game, where the Nash equilibrium policy guarantees a 50% win rate against any competing policy. However, previous algorithms for finding the Nash policy either diverge or converge to a Nash policy in a modified game, even in a simple synthetic setting, thereby failing to maintain the 50% win rate guarantee against all other policies. We propose a meta-algorithm, Convergent Meta Alignment Algorithm (COMAL), for LLM alignment with general preferences, inspired by convergent algorithms in game theory. Theoretically, we prove that our meta-algorithm converges to an exact Nash policy in the last iterate. Additionally, our meta-algorithm is simple and can be integrated with many existing methods designed for RLHF and preference optimization with minimal changes. Experimental results demonstrate the effectiveness of the proposed framework when combined with existing preference policy optimization methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Adaptively perturbed mirror descent for learning in games. In Proceedings of the 41st International Conference on Machine Learning, 2024.
  2. The multiplicative weights update method: a meta-algorithm and applications. Theory of computing, 8(1):121–164, 2012.
  3. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pages 4447–4455. PMLR, 2024.
  4. Generalized monotone operators and their averaged resolvents. Math. Program., 189(1):55–74, 2021. doi: 10.1007/S10107-020-01500-6. URL https://doi.org/10.1007/s10107-020-01500-6.
  5. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39:324, 1952. URL https://api.semanticscholar.org/CorpusID:125209808.
  6. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  7. Accelerated single-call methods for constrained min-max optimization. International Conference on Learning Representations (ICLR), 2023a.
  8. Doubly optimal no-regret learning in monotone games. In International Conference on Machine Learning, pages 3507–3524. PMLR, 2023b.
  9. Finite-time last-iterate convergence for learning in multi-player games. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  10. Uncoupled and convergent learning in two-player zero-sum markov games with bandit feedback. Advances in Neural Information Processing Systems, 36, 2023.
  11. Fast last-iterate convergence of learning in games requires forgetful algorithms. arXiv preprint arXiv:2406.10631, 2024a.
  12. Accelerated algorithms for constrained nonconvex-nonconcave min-max optimization and comonotone inclusion. In Forty-first International Conference on Machine Learning, 2024b.
  13. Human alignment of large language models through online preference optimisation. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=2RQqg2Y7Y6.
  14. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett, editors, Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 4299–4307, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
  15. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
  16. The limit points of (optimistic) gradient descent in min-max optimization. In the 32nd Annual Conference on Neural Information Processing Systems (NeurIPS), 2018.
  17. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783.
  18. KTO: model alignment as prospect theoretic optimization. CoRR, abs/2402.01306, 2024. doi: 10.48550/ARXIV.2402.01306. URL https://doi.org/10.48550/arXiv.2402.01306.
  19. Finite-dimensional variational inequalities and complementarity problems. Springer, 2003.
  20. REBEL: reinforcement learning via regressing relative rewards. CoRR, abs/2404.16767, 2024. doi: 10.48550/ARXIV.2404.16767. URL https://doi.org/10.48550/arXiv.2404.16767.
  21. Tight last-iterate convergence rates for no-regret learning in multi-player games. Advances in neural information processing systems (NeurIPS), 2020a.
  22. Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems. In Conference on Learning Theory (COLT), 2020b.
  23. Last-iterate convergence of optimistic gradient method for monotone variational inequalities. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/893cd874ba98afa54ae9e385a24a83ac-Abstract-Conference.html.
  24. Adaptive learning in continuous games: Optimal regret bounds and convergence to nash equilibrium. In Conference on Learning Theory, pages 2388–2422. PMLR, 2021.
  25. Inexact variants of the proximal point algorithm without monotonicity. SIAM Journal on Optimization, 13(4):1080–1097, 2003. doi: 10.1137/S1052623401399587. URL https://doi.org/10.1137/S1052623401399587.
  26. G. M. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976. URL https://ci.nii.ac.jp/naid/10017556617/.
  27. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787, 2024.
  28. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 5 2023.
  29. Kenneth O May. Intransitivity, utility, and the aggregation of preference patterns. Econometrica: Journal of the Econometric Society, pages 1–13, 1954.
  30. Cycles in adversarial regularized learning. In Artur Czumaj, editor, Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, pages 2703–2717. SIAM, 2018. doi: 10.1137/1.9781611975031.172. URL https://doi.org/10.1137/1.9781611975031.172.
  31. A unified analysis of extra-gradient and optimistic gradient methods for saddle point problems: Proximal point approach. In International Conference on Artificial Intelligence and Statistics (AISTATS), 2020a.
  32. Convergence rate of 𝒪⁢(1/k)𝒪1𝑘\mathcal{O}(1/k)caligraphic_O ( 1 / italic_k ) for optimistic gradient and extragradient methods in smooth convex-concave saddle point problems. SIAM Journal on Optimization, 30(4):3230–3251, 2020b.
  33. Nash learning from human feedback. In Forty-first International Conference on Machine Learning, 2024.
  34. Arkadi Nemirovski. Prox-method with rate of convergence o (1/t) for variational inequalities with lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
  35. Problem complexity and method efficiency in optimization. Wiley-Interscience, 1983.
  36. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023. doi: 10.48550/ARXIV.2303.08774. URL https://doi.org/10.48550/arXiv.2303.08774.
  37. Training language models to follow instructions with human feedback. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.
  38. Offsetbias: Leveraging debiased data for tuning evaluators. arXiv preprint arXiv:2407.06551, 2024.
  39. From poincaré recurrence to convergence in imperfect information games: Finding equilibrium via regularization. In International Conference on Machine Learning, pages 8525–8535. PMLR, 2021.
  40. Mastering the game of stratego with model-free multiagent reinforcement learning. Science, 378(6623):990–996, 2022.
  41. Leonid Denisovich Popov. A modification of the Arrow-Hurwicz method for search of saddle points. Mathematical notes of the Academy of Sciences of the USSR, 28(5):845–848, 1980. Publisher: Springer.
  42. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  43. Optimization, learning, and games with predictable sequences. Advances in Neural Information Processing Systems, 2013.
  44. Offline regularised reinforcement learning for large language models alignment. arXiv preprint arXiv:2405.19107, 2024.
  45. R. Tyrrell Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Optimization, 14(5):877–898, 1976. doi: 10.1137/0314056. URL https://doi.org/10.1137/0314056.
  46. Direct nash optimization: Teaching language models to self-improve with general preferences. CoRR, abs/2404.03715, 2024. doi: 10.48550/ARXIV.2404.03715. URL https://doi.org/10.48550/arXiv.2404.03715.
  47. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  48. A minimaximalist approach to reinforcement learning from human feedback, 2024.
  49. Fast convergence of regularized learning in games. Advances in Neural Information Processing Systems (NeurIPS), 2015.
  50. Generalized preference optimization: A unified approach to offline alignment. In Forty-first International Conference on Machine Learning, 2024.
  51. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  52. Amos Tversky. Intransitivity of preferences. Psychological review, 76(1):31, 1969.
  53. Linear last-iterate convergence in constrained saddle-point optimization. In International Conference on Learning Representations (ICLR), 2021.
  54. Self-play preference optimization for language model alignment. arXiv preprint arXiv:2405.00675, 2024.
  55. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
  56. A theoretical analysis of nash learning from human feedback under general kl-regularized preference. arXiv preprint arXiv:2402.07314, 2024.
  57. Iterative nash policy optimization: Aligning llms with general preferences via no-regret learning. arXiv preprint arXiv:2407.00617, 2024.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 5 tweets and received 3 likes.

Upgrade to Pro to view all of the tweets about this paper: