Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Self-Play Preference Optimization for Language Model Alignment (2405.00675v5)

Published 1 May 2024 in cs.LG, cs.AI, cs.CL, and stat.ML
Self-Play Preference Optimization for Language Model Alignment

Abstract: Standard reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate LLM alignment. In this paper, we propose a self-play-based method for LLM alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium. Additionally, we propose a new SPPO objective which is both strongly motivated by theory and is simple and effective in practice. In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench, Arena-Hard, and the Open LLM Leaderboard. Starting from a stronger base model Llama-3-8B-Instruct, we are able to achieve a length-controlled win rate of 38.77%. Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger LLMs. Codes are available at https://github.com/uclaml/SPPO.

Understanding Self-Play Preference Optimization for Aligning LLMs

Introduction

Reinforcement Learning from Human Feedback (RLHF) has significantly advanced the development of LLMs, which are pivotal in generating human-like responses in various scenarios. However, existing RLHF techniques, heavily reliant on parametric models like the Bradley-Terry model, do not adequately address the complexity and non-transitivity found in human preferences. The paper introduces a novel approach, Self-Play Preference Optimization (SPPO), that reimagines RLHF, focusing on approximating the Nash equilibrium in a two-player constant-sum game. This method leverages iterative updates to refine LLM responses, aligning them more closely with human-like preferences.

What SPPO Brings to the Table

SPPO proposes a distinct methodology from traditional RLHF by emphasizing direct engagement with preference probabilities, improving flexibility in capturing human preferences. Here's what makes SPPO stand out:

  • Provably Convergent: SPPO employs a theoretical model for convergence through multiplicative weights, promising that over iterations, the model approaches a Nash equilibrium.
  • Practical Excellence: Empirically tested on the UltraFeedback dataset with the PairRM preference model, SPPO showcases significant improvements. For instance, it achieves a 28.53% length-controlled win-rate over GPT-4-Turbo in the AlpacaEval 2.0 setup.
  • Deep Focus on Preference Interactions: Different from typical pairwise loss systems, SPPO is engineered to increase the log-likelihood of a selected response and decrease that of the rejected, addressing a common shortfall in symmetric loss functions like DPO and IPO.

Theoretical Foundations and Practical Implications

SPPO constructs its methodology around the idea of each model iteration playing against its predecessor, honing policy through self-play that is both practical and theoretically grounded. It suggests that:

  • Effective Self-Play: By iteratively playing against itself, the model self-adjusts through exposure to a diverse range of responses generated from past iterations, enriching its response quality over time.
  • Handling Non-Transitivity: Directly tackling the non-transitivity in human preferences makes SPPO particularly adept at managing complex preference scenarios, unlike the linear assumptions seen in models like Bradley-Terry.

SPPO Experimentation and Observations

SPPO's real-world application involves a series of experiments using a base LLM model improved iteratively with minimal external supervision. Some notable achievements include:

  • Strong Performance Across Benchmarks: In comparative studies with existing methods, SPPO consistently demonstrates superior ability to align LLM outputs with human preferences across various benchmarks like MT-Bench and the Open LLM Leaderboard.
  • Scalability and Efficiency: Despite using a relatively smaller pre-trained model and fewer data samples, SPPO matches or even surpasses much larger models in head-to-head comparisons.

Future Directions and Speculation

Looking ahead, SPPO sets a promising path for further research into efficient and scalable solutions for LLM training. Future research could explore:

  • Broader Application Domains: Applying SPPO in other areas of AI, such as automated dialog systems or personalized learning environments, could provide increased interactivity and satisfaction.
  • Improvements in Sampling and Estimation: Enhancements in how responses are sampled and preferences are estimated could lead to even more robust models.
  • Integration with Other Learning Paradigms: Combining SPPO's approach with other machine learning paradigms might yield interesting synergies, particularly in areas requiring nuanced understanding of human feedback.

In summary, the SPPO framework not only strengthens the foundation of RLHF for LLMs through theoretical assurances but also impresses with its practical dominance in empirical tests. This dual strength paves the way for crafting more responsive and human-aligned LLMs in the future.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861 .
  2. A general theoretical paradigm to understand learning from human preferences. arXiv preprint arXiv:2310.12036 .
  3. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  4. Open llm leaderboard. Hugging Face .
  5. Rank Analysis of Incomplete Block Designs: I. The Method of Paired Comparisons. Biometrika 39 324–345.
  6. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335 .
  7. Deep reinforcement learning from human preferences. Advances in neural information processing systems 30.
  8. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457 .
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 .
  10. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377 .
  11. Length-controlled alpacaeval: A simple way to debias automatic evaluators. arXiv preprint arXiv:2404.04475 .
  12. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems 36.
  13. Contextual dueling bandits. In Conference on Learning Theory. PMLR.
  14. Kto: Model alignment as prospect theoretic optimization. arXiv preprint arXiv:2402.01306 .
  15. Adaptive game playing using multiplicative weights. Games and Economic Behavior 29 79–103.
  16. Scaling laws for reward model overoptimization. In International Conference on Machine Learning. PMLR.
  17. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning. PMLR.
  18. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing.
  19. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300 .
  20. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691 .
  21. Reinforcement learning from human feedback with active queries. arXiv preprint arXiv:2402.09401 .
  22. Mistral 7b. arXiv preprint arXiv:2310.06825 .
  23. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561 .
  24. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470 .
  25. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  26. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958 .
  27. Statistical rejection sampling improves preference optimization. arXiv preprint arXiv:2309.06657 .
  28. Active ranking without strong stochastic transitivity. Advances in neural information processing systems .
  29. Nash learning from human feedback. arXiv preprint arXiv:2312.00886 .
  30. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 .
  31. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 27730–27744.
  32. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228 .
  33. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems 36.
  34. Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715 .
  35. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM 64 99–106.
  36. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 .
  37. Beyond human data: Scaling self-training for problem-solving with language models. arXiv preprint arXiv:2312.06585 .
  38. A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056 .
  39. Thurstone, L. (1927). A law of comparative judgment. Psychological Review 34 273.
  40. Tversky, A. (1969). Intransitivity of preferences. Psychological review 76 31.
  41. Is rlhf more difficult than standard rl? a theoretical perspective. Advances in Neural Information Processing Systems 36.
  42. Borda regret minimization for generalized linear dueling bandits. In ICML 2023 Workshop The Many Facets of Preference-Based Learning.
  43. Gibbs sampling from human feedback: A provable kl-constrained framework for rlhf. arXiv preprint arXiv:2312.11456 .
  44. Some things are more cringe than others: Preference optimization with the pairwise cringe loss. arXiv preprint arXiv:2312.16682 .
  45. A theoretical analysis of nash learning from human feedback under general kl-regularized preference. arXiv preprint arXiv:2402.07314 .
  46. Self-rewarding language models. arXiv preprint arXiv:2401.10020 .
  47. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830 .
  48. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425 .
  49. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 36.
  50. Principled reinforcement learning with human feedback from pairwise or k𝑘kitalic_k-wise comparisons. arXiv preprint arXiv:2301.11270 .
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yue Wu (338 papers)
  2. Zhiqing Sun (35 papers)
  3. Huizhuo Yuan (16 papers)
  4. Kaixuan Ji (11 papers)
  5. Yiming Yang (151 papers)
  6. Quanquan Gu (198 papers)
Citations (70)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com