Self-Play Preference Optimization for Language Model Alignment (2405.00675v5)

Published 1 May 2024 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: Standard reinforcement learning from human feedback (RLHF) approaches relying on parametric models like the Bradley-Terry model fall short in capturing the intransitivity and irrationality in human preferences. Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences, enabling more flexible and accurate LLM alignment. In this paper, we propose a self-play-based method for LLM alignment, which treats the problem as a constant-sum two-player game aimed at identifying the Nash equilibrium policy. Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium. Additionally, we propose a new SPPO objective which is both strongly motivated by theory and is simple and effective in practice. In our experiments, using only 60k prompts (without responses) from the UltraFeedback dataset and without any prompt augmentation, by leveraging a pre-trained preference model PairRM with only 0.4B parameters, SPPO can obtain a model from fine-tuning Mistral-7B-Instruct-v0.2 that achieves the state-of-the-art length-controlled win-rate of 28.53% against GPT-4-Turbo on AlpacaEval 2.0. It also outperforms the (iterative) DPO and IPO on MT-Bench, Arena-Hard, and the Open LLM Leaderboard. Starting from a stronger base model Llama-3-8B-Instruct, we are able to achieve a length-controlled win rate of 38.77%. Notably, the strong performance of SPPO is achieved without additional external supervision (e.g., responses, preferences, etc.) from GPT-4 or other stronger LLMs. Codes are available at https://github.com/uclaml/SPPO.

PDF HTML Abstract

Understanding Self-Play Preference Optimization for Aligning LLMs

Introduction

Reinforcement Learning from Human Feedback (RLHF) has significantly advanced the development of LLMs, which are pivotal in generating human-like responses in various scenarios. However, existing RLHF techniques, heavily reliant on parametric models like the Bradley-Terry model, do not adequately address the complexity and non-transitivity found in human preferences. The paper introduces a novel approach, Self-Play Preference Optimization (SPPO), that reimagines RLHF, focusing on approximating the Nash equilibrium in a two-player constant-sum game. This method leverages iterative updates to refine LLM responses, aligning them more closely with human-like preferences.

What SPPO Brings to the Table

SPPO proposes a distinct methodology from traditional RLHF by emphasizing direct engagement with preference probabilities, improving flexibility in capturing human preferences. Here's what makes SPPO stand out:

Provably Convergent: SPPO employs a theoretical model for convergence through multiplicative weights, promising that over iterations, the model approaches a Nash equilibrium.
Practical Excellence: Empirically tested on the UltraFeedback dataset with the PairRM preference model, SPPO showcases significant improvements. For instance, it achieves a 28.53% length-controlled win-rate over GPT-4-Turbo in the AlpacaEval 2.0 setup.
Deep Focus on Preference Interactions: Different from typical pairwise loss systems, SPPO is engineered to increase the log-likelihood of a selected response and decrease that of the rejected, addressing a common shortfall in symmetric loss functions like DPO and IPO.

Theoretical Foundations and Practical Implications

SPPO constructs its methodology around the idea of each model iteration playing against its predecessor, honing policy through self-play that is both practical and theoretically grounded. It suggests that:

Effective Self-Play: By iteratively playing against itself, the model self-adjusts through exposure to a diverse range of responses generated from past iterations, enriching its response quality over time.
Handling Non-Transitivity: Directly tackling the non-transitivity in human preferences makes SPPO particularly adept at managing complex preference scenarios, unlike the linear assumptions seen in models like Bradley-Terry.

SPPO Experimentation and Observations

SPPO's real-world application involves a series of experiments using a base LLM model improved iteratively with minimal external supervision. Some notable achievements include:

Strong Performance Across Benchmarks: In comparative studies with existing methods, SPPO consistently demonstrates superior ability to align LLM outputs with human preferences across various benchmarks like MT-Bench and the Open LLM Leaderboard.
Scalability and Efficiency: Despite using a relatively smaller pre-trained model and fewer data samples, SPPO matches or even surpasses much larger models in head-to-head comparisons.

Future Directions and Speculation

Looking ahead, SPPO sets a promising path for further research into efficient and scalable solutions for LLM training. Future research could explore:

Broader Application Domains: Applying SPPO in other areas of AI, such as automated dialog systems or personalized learning environments, could provide increased interactivity and satisfaction.
Improvements in Sampling and Estimation: Enhancements in how responses are sampled and preferences are estimated could lead to even more robust models.
Integration with Other Learning Paradigms: Combining SPPO's approach with other machine learning paradigms might yield interesting synergies, particularly in areas requiring nuanced understanding of human feedback.

In summary, the SPPO framework not only strengthens the foundation of RLHF for LLMs through theoretical assurances but also impresses with its practical dominance in empirical tests. This dual strength paves the way for crafting more responsive and human-aligned LLMs in the future.