Direct Preference Optimization (DPO) for Language Model Alignment
Direct Preference Optimization (DPO) is an offline methodology for aligning LLMs and other sequence-to-sequence models with human preferences by directly fitting model outputs to human or synthetic preference data. DPO operates by maximizing the likelihood of preferred responses over dispreferred ones using a preference modeling framework, typically removing the need for an explicit reward model or policy-gradient reinforcement learning. As a general strategy, DPO is recognized for improved stability, scalability, and sample efficiency over traditional RLHF approaches.
1. Core Principles of Direct Preference Optimization
DPO formulates preference alignment as a maximum likelihood estimation (MLE) problem, grounded in the Bradley-Terry model for pairwise preferences. The central objective, for inputs and sequence pairs , models the probability that is preferred over given as: where:
- is the optimal (post-alignment) policy to be learned,
- is a supervised fine-tuned (SFT; reference) policy,
- is a temperature-like hyperparameter balancing preference learning and regularization.
DPO fits a model by minimizing the cross-entropy (logistic loss) between observed preference pairs and this model-implied probability, which directly encodes the pairwise preference relationship: with as the sigmoid function.
Unlike RLHF (which involves training a reward model plus RL optimization, such as PPO), DPO bypasses the reward model entirely and works as an offline, supervised preference model.
2. Distributional Mismatch and Data Sampling Limitations
A critical challenge in DPO arises due to distribution mismatch between the optimal policy and the data distribution from which preference pairs are obtained:
- Theoretically, optimality requires preference pairs to be sampled from .
- In practice, human preferences are collected against SFT or previous policy models, not , causing a mismatch.
- Consequently, the DPO-optimized policy may not align optimally with the intended preference distribution.
This mismatch can limit the achievable quality and faithfulness of model alignment.
3. Statistical Rejection Sampling Optimization (RSO)
To address these limitations, the Statistical Rejection Sampling Optimization (RSO) framework is introduced, providing a principled mechanism to approximate sampling from the optimal policy via rejection sampling based on a trained reward model:
- Reward Model Training: A pairwise reward model is trained on available human preference data to score which response is better for a given prompt.
- Pointwise Rewards: Given a baseline response , responses are assigned scores $r_\psi(x, y) = \logit(\rho_\psi(x, y, y_b))$.
- SFT Sampling: For each prompt, multiple candidate responses are sampled from the SFT policy.
- Rejection Sampling: Each candidate is accepted into the RSO dataset with probability , where and is the highest reward among candidates. This ensures accepted samples reflect the distribution of an optimal policy biased by the reward model.
- Pair Construction and Training: Accepted candidates are paired and labeled with the reward model, then used to train the final policy via DPO (or other preference-based) loss.
RSO bridges the distributional gap by constructing preference pairs that are statistically closer to those originating from the target post-alignment distribution.
4. Unified Preference Modeling Framework
The paper integrates DPO and SLiC (Sequence Likelihood Calibration) within a unified theoretical framework, observing that they differ primarily in their loss formulations:
- DPO: Employs a logistic (sigmoid cross-entropy) loss, corresponding to probabilistic preference modeling.
- SLiC: Uses margin-based hinge loss, akin to an SVM, with or without normalization.
- These families can be generalized and cross-applied, supporting robust preference optimization across a spectrum of loss functions and regularization schemes.
This unified view allows rigorously principled comparisons and hybrid methods, as loss choice directly controls the tradeoff between probabilistic modeling and margin-enforced separation.
5. Empirical Validation and Performance Comparison
The RSO methodology is empirically evaluated across summarization and dialogue tasks using LLMs of varying sizes (e.g., T5-large and T5-XXL), employing both model-based (e.g., PaLM 2) and human evaluations:
- Summarization: On Reddit TL;DR, RSO achieves higher proxy and human-judge win rates compared to DPO and SLiC (e.g., RSO 93.36% vs DPO 84.35% on the proxy reward for T5-XXL).
- Dialogue alignment: On AnthropicHH, RSO achieves up to 40.98% win rate (AutoSxS; PaLM 2-L), surpassing DPO and SLiC baselines.
- Human Preferences: RSO-generated policies receive the highest preference choice fractions and qualitative scores from human raters.
- Ablation and Transfer: RSO demonstrates robust performance with varying sample/ranking strategies and transfers well to out-of-domain tasks such as CNN/DailyMail summarization.
- Computational cost: The rejection sampling and extra candidate evaluation incur negligible overhead compared to total training, and scale efficiently with parallelization.
RLHF | SLiC | DPO | RSO (Ours) | |
---|---|---|---|---|
Reward Model | Yes | Yes (ranking) | No | Yes (pairwise) |
Sampling | Online | SFT | None | SFT+Rejection |
Pair Source | On-Pol | SFT decodes | Human data | Rejection-sampled |
Loss | RL/PPO | Contrastive | Logistic (DPO) | Any (unified) |
Alignment | Optimal | Misspecified | Misspecified | Closer to opt |
Stability/Scalability | Poor/Costly | High | High | High |
Performance | Good | Good | Good | Best |
6. Implications and Future Research
RSO broadens the practical utility of direct preference optimization by:
- Scalability: Extending naturally to large LLMs and parallelized, batched offline pipelines.
- Generalizability: Applicability across a wide variety of sequence modeling and preference-task domains (summarization, dialogue, open-ended generation).
- Online and Multi-objective Settings: Potential generalization to online variants, fairness- or harmlessness-aware preference sampling, and multi-objective fine-tuning.
- Reward Model Improvements: Future improvements may arise from joint or active reward model learning within the RSO pipeline.
- Non-human/Efficient Feedback: The construction of preference data via AIF (AI feedback) processes is compatible, enabling alignment with limited human annotation.
7. Summary and Comparative Perspective
Statistical Rejection Sampling Optimization advances the DPO framework by rectifying the critical distributional mismatch between preference pair construction and optimal model training. It develops a theoretically optimal, practical scheme for preference pair generation, empirically validated across robust benchmarks. RSO achieves improvements over previous SOTA methods (RLHF, SLiC, DPO) in terms of alignment accuracy and stability, while maintaining scalable, straightforward workflows with modest computational requirements. The integration of unified preference modeling provides added clarity and extensibility for future investigations into efficient and reliable preference alignment.