Direct Preference Optimization (DPO)

Updated 1 July 2025

Direct Preference Optimization (DPO) is an offline method that aligns language models with human preferences by directly training on preference data pairs, eliminating the need for a separate reward model or reinforcement learning.
While DPO is stable and scalable, a core challenge is the mismatch between the data distribution and the optimal policy, which Statistical Rejection Sampling Optimization (RSO) addresses by generating preference pairs statistically closer to the target.
RSO demonstrates improved performance, stability, and scalability across tasks compared to DPO and other methods, integrating within a unified preference modeling framework for broader applicability.

Direct Preference Optimization (DPO) is an offline methodology for aligning LLMs and other sequence-to-sequence models with human preferences by directly fitting model outputs to human or synthetic preference data. DPO operates by maximizing the likelihood of preferred responses over dispreferred ones using a preference modeling framework, typically removing the need for an explicit reward model or policy-gradient reinforcement learning. As a general strategy, DPO is recognized for improved stability, scalability, and sample efficiency over traditional RLHF approaches.

1. Core Principles of Direct Preference Optimization

DPO formulates preference alignment as a maximum likelihood estimation (MLE) problem, grounded in the Bradley-Terry model for pairwise preferences. The central objective, for inputs $x$ and sequence pairs $(y_1, y_2)$ , models the probability that $y_1$ is preferred over $y_2$ given $x$ as: $p^*(y_1 \succ y_2 | x) = \frac{1}{1 + \exp\bigl(\beta \log\frac{\pi^*(y_2|x)}{\pi_\mathrm{ref}(y_2|x)} - \beta \log\frac{\pi^*(y_1|x)}{\pi_\mathrm{ref}(y_1|x)}\bigr)}$ where:

$\pi^*$ is the optimal (post-alignment) policy to be learned,
$\pi_\mathrm{ref}$ is a supervised fine-tuned (SFT; reference) policy,
$\beta$ is a temperature-like hyperparameter balancing preference learning and regularization.

DPO fits a model by minimizing the cross-entropy (logistic loss) between observed preference pairs and this model-implied probability, which directly encodes the pairwise preference relationship: $\mathcal{L}_\mathrm{DPO} = -\mathbb{E}_{(x, y_w, y_l)}\bigl[\log \sigma\bigl(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_\mathrm{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_\mathrm{ref}(y_l|x)}\bigr)\bigr]$ with $\sigma(\cdot)$ as the sigmoid function.

Unlike RLHF (which involves training a reward model plus RL optimization, such as PPO), DPO bypasses the reward model entirely and works as an offline, supervised preference model.

2. Distributional Mismatch and Data Sampling Limitations

A critical challenge in DPO arises due to distribution mismatch between the optimal policy $\pi^*$ and the data distribution from which preference pairs are obtained:

Theoretically, optimality requires preference pairs to be sampled from $\pi^*$ .
In practice, human preferences are collected against SFT or previous policy models, not $\pi^*$ , causing a mismatch.
Consequently, the DPO-optimized policy may not align optimally with the intended preference distribution.

This mismatch can limit the achievable quality and faithfulness of model alignment.

3. Statistical Rejection Sampling Optimization (RSO)

To address these limitations, the Statistical Rejection Sampling Optimization (RSO) framework is introduced, providing a principled mechanism to approximate sampling from the optimal policy via rejection sampling based on a trained reward model:

Reward Model Training: A pairwise reward model is trained on available human preference data to score which response is better for a given prompt.
Pointwise Rewards: Given a baseline response $y_b$ , responses $y$ are assigned scores $r_\psi(x, y) = \logit(\rho_\psi(x, y, y_b))$.
SFT Sampling: For each prompt, multiple candidate responses are sampled from the SFT policy.
Rejection Sampling: Each candidate $y$ is accepted into the RSO dataset with probability $u < \exp\left(\frac{1}{\beta}(r_\psi(x, y) - r_{\max})\right)$ , where $u \sim \mathrm{Uniform}[0, 1]$ and $r_{\max}$ is the highest reward among candidates. This ensures accepted samples reflect the distribution of an optimal policy biased by the reward model.
Pair Construction and Training: Accepted candidates are paired and labeled with the reward model, then used to train the final policy via DPO (or other preference-based) loss.

RSO bridges the distributional gap by constructing preference pairs that are statistically closer to those originating from the target post-alignment distribution.

4. Unified Preference Modeling Framework

The paper integrates DPO and SLiC (Sequence Likelihood Calibration) within a unified theoretical framework, observing that they differ primarily in their loss formulations:

DPO: Employs a logistic (sigmoid cross-entropy) loss, corresponding to probabilistic preference modeling.
SLiC: Uses margin-based hinge loss, akin to an SVM, with or without normalization.
These families can be generalized and cross-applied, supporting robust preference optimization across a spectrum of loss functions and regularization schemes.

This unified view allows rigorously principled comparisons and hybrid methods, as loss choice directly controls the tradeoff between probabilistic modeling and margin-enforced separation.

5. Empirical Validation and Performance Comparison

The RSO methodology is empirically evaluated across summarization and dialogue tasks using LLMs of varying sizes (e.g., T5-large and T5-XXL), employing both model-based (e.g., PaLM 2) and human evaluations:

Summarization: On Reddit TL;DR, RSO achieves higher proxy and human-judge win rates compared to DPO and SLiC (e.g., RSO 93.36% vs DPO 84.35% on the proxy reward for T5-XXL).
Dialogue alignment: On AnthropicHH, RSO achieves up to 40.98% win rate (AutoSxS; PaLM 2-L), surpassing DPO and SLiC baselines.
Human Preferences: RSO-generated policies receive the highest preference choice fractions and qualitative scores from human raters.
Ablation and Transfer: RSO demonstrates robust performance with varying sample/ranking strategies and transfers well to out-of-domain tasks such as CNN/DailyMail summarization.
Computational cost: The rejection sampling and extra candidate evaluation incur negligible overhead compared to total training, and scale efficiently with parallelization.

	RLHF	SLiC	DPO	RSO (Ours)
Reward Model	Yes	Yes (ranking)	No	Yes (pairwise)
Sampling	Online	SFT	None	SFT+Rejection
Pair Source	On-Pol	SFT decodes	Human data	Rejection-sampled
Loss	RL/PPO	Contrastive	Logistic (DPO)	Any (unified)
Alignment	Optimal	Misspecified	Misspecified	Closer to opt
Stability/Scalability	Poor/Costly	High	High	High
Performance	Good	Good	Good	Best

6. Implications and Future Research

RSO broadens the practical utility of direct preference optimization by:

Scalability: Extending naturally to large LLMs and parallelized, batched offline pipelines.
Generalizability: Applicability across a wide variety of sequence modeling and preference-task domains (summarization, dialogue, open-ended generation).
Online and Multi-objective Settings: Potential generalization to online variants, fairness- or harmlessness-aware preference sampling, and multi-objective fine-tuning.
Reward Model Improvements: Future improvements may arise from joint or active reward model learning within the RSO pipeline.
Non-human/Efficient Feedback: The construction of preference data via AIF (AI feedback) processes is compatible, enabling alignment with limited human annotation.

7. Summary and Comparative Perspective

Statistical Rejection Sampling Optimization advances the DPO framework by rectifying the critical distributional mismatch between preference pair construction and optimal model training. It develops a theoretically optimal, practical scheme for preference pair generation, empirically validated across robust benchmarks. RSO achieves improvements over previous SOTA methods (RLHF, SLiC, DPO) in terms of alignment accuracy and stability, while maintaining scalable, straightforward workflows with modest computational requirements. The integration of unified preference modeling provides added clarity and extensibility for future investigations into efficient and reliable preference alignment.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Direct Preference Optimization.