Flipping-Aware Direct Preference Optimization

Updated 7 December 2025

The paper introduces FA-DPO, which robustly integrates instance-dependent flipping probabilities into direct preference optimization to correct human-annotated noise.
It employs a two-stage generative model using a Bradley-Terry framework and logistic regression on features like response length, perplexity, and reward margin.
Empirical results on datasets such as Ultrafeedback demonstrate that FA-DPO achieves superior accuracy and win rates across various flipping ratios.

Flipping-Aware Direct Preference Optimization (FA-DPO) is an algorithmic framework designed to make reinforcement learning with human feedback (RLHF) robust to preference flipping in human-annotated datasets. Preference flipping refers to the corruption of pairwise preference labels due to various external factors after the initial (intended) human annotation. FA-DPO models such corruption instance-dependently, explicitly estimates flipping probabilities, and integrates this correction into the Direct Preference Optimization (DPO) objective, leading to significant gains in robustness for model alignment tasks (Xu et al., 30 Nov 2025).

1. Flipped Preference Noise: Modeling and Problem Setting

The foundation of FA-DPO lies in a two-stage generative model for human preference data. First, genuine human intent is modeled by a Bradley-Terry (BT) model. For prompt $x$ with candidate responses $y_w$ (winner) and $y_l$ (loser), the true (clean) preference probability is

$p^*(y_w \succ y_l \mid x) = \sigma(r^*(x, y_w) - r^*(x, y_l)), \quad \sigma(z) = \frac{1}{1+e^{-z}},$

where $r^*$ is an unobserved reward function.

Subsequently, an external corruption process may flip the label with instance-dependent probability $\varepsilon_{\bm{x}}$ , where $\bm{x} = (x, y_w, y_l)$ denotes the data triplet. The observed (possibly flipped) preference probability is

$\tilde{p} = (1-\varepsilon_{\bm{x}}) \cdot p^* + \varepsilon_{\bm{x}} \cdot (1-p^*) = p^* - \varepsilon_{\bm{x}} (2p^*-1).$

Equivalently,

$\tilde{\mathbb{P}}\{y_w \succ y_l \mid x\} = (1-\varepsilon_{\bm{x}}) \mathbb{P}\{y_w \succ y_l \mid x\} + \varepsilon_{\bm{x}} \mathbb{P}\{y_l \succ y_w \mid x\}.$

This model captures both intent and systematic, instance-dependent annotation noise.

2. Instance-Dependent Flipping Probability Estimation

FA-DPO parameterizes the flipping probability $\varepsilon_{\bm{x}}$ as a logistic function of features $h(\bm{x}) \in \mathbb{R}^d$ : $\varepsilon_{\bm{x}} = \sigma(\langle \omega, h(\bm{x}) \rangle + \omega_0)$ with trainable parameters $\omega, \omega_0$ .

The feature vector $h(\bm{x})$ concatenates three groups of permutation-equivariant statistics:

Response length features:

$h_{\text{len}}(\bm{x}) = \left[ \frac{|y_w| + |y_l|}{2}, \; \left| |y_w| - |y_l| \right| \right]^\top$

Perplexity features (with respect to current policy $\pi_\theta$ ):

$h_{\text{ppl}}(\bm{x}) = \left[ \frac{\log \pi_\theta(y_w|x) + \log \pi_\theta(y_l|x)}{2},\; \left| \log \frac{\pi_\theta(y_w|x)}{\pi_\theta(y_l|x)} \right| \right]^\top$

Reward margin features (with implicit reward $\hat{r}_\theta = \beta \log(\pi_\theta / \pi_\text{ref})$ ):

$h_{\text{margin}}(\bm{x}) = \left[ \frac{\hat{r}_\theta(x, y_w) + \hat{r}_\theta(x, y_l)}{2},\; \left| \hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l) \right| \right]^\top$

The full feature vector is $h(\bm{x}) = [h_{\text{len}}(\bm{x}), h_{\text{ppl}}(\bm{x}), h_{\text{margin}}(\bm{x}), 1]^\top$ .

The assumptions required for well-posedness include boundedness of $h(\bm{x})$ and $\omega$ , and positive definiteness of $\mathbb{E}[h(\bm{x}) h(\bm{x})^\top]$ for the logistic regression subproblem.

3. Robust Objective: The FA-DPO Loss

FA-DPO generalizes the original DPO loss to account for observed label flipping via a corrupted likelihood:

Standard DPO uses

$\mathcal{L}_{\mathrm{DPO}}(\theta) = - \mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}_{\text{clean}}} \left[ \log \sigma( \hat{r}_\theta(x,y_w) - \hat{r}_\theta(x,y_l) ) \right],$

with $\hat{r}_\theta(x,y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}$ .

FA-DPO on noisy data $\mathcal{D}_\text{noisy}$ replaces the clean probability with the corrupted likelihood:

$\tilde{p}_\theta(\bm{x}) = (1-\varepsilon_{\bm{x}})p_\theta + \varepsilon_{\bm{x}}(1-p_\theta),$

leading to

$\boxed{ \mathcal{L}_{\text{FA-DPO}}(\theta, \omega) = -\mathbb{E}_{\bm{x} \sim \mathcal{D}_\text{noisy}} \left[ \log\left( (1-\varepsilon_{\bm{x}})p_\theta + \varepsilon_{\bm{x}}(1-p_\theta) \right) \right] }$

where $p_\theta = \sigma( \hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l) )$ .

An MLE perspective connects the derivation: maximizing the log-likelihood under the observed (possibly corrupted) labels using the instance-dependent flip model.

4. Joint Optimization Procedure

Optimization is performed by alternating updating the noise model parameters $(\omega, \omega_0)$ and the main policy parameters $\theta$ . Each update calculates the instance-dependent noisy-label loss and backpropagates through both $\theta$ and $\omega$ .

Pseudocode for the alternating-update algorithm is as follows:

Step	Update	Calculation
1. Fix $\theta$ , update $\omega$	Minimize $-\frac{1}{B} \sum_{i} \log((1-\varepsilon_i)p_i + \varepsilon_i(1-p_i))$	Compute $\varepsilon_i$ , $p_i$ for minibatch; take gradient step on $\omega$
2. Fix $\omega$ , update $\theta$	Minimize $-\frac{1}{B} \sum_{i} \log((1-\varepsilon_i)p_i + \varepsilon_i(1-p_i))$	Recompute with new $\omega$ , update $\theta$

Detailed sample-wise gradients:

For $z = \hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l)$ , $p = \sigma(z)$ , and $\tilde{p} = (1-\varepsilon)p + \varepsilon(1-p)$ , the gradients are

$\frac{\partial}{\partial z}[-\log \tilde{p}] = -\frac{1-2\varepsilon}{\tilde{p}} p(1-p)$

$\nabla_\theta \ell = \frac{1-2\varepsilon}{\tilde{p}} p(1-p) \cdot \beta \left( \nabla_\theta \log \pi_\theta(y_l|x) - \nabla_\theta \log \pi_\theta(y_w|x) \right)$

$\frac{\partial}{\partial \varepsilon}[-\log \tilde{p}] = -\frac{2p-1}{\tilde{p}}, \quad \nabla_\omega \varepsilon = \varepsilon (1-\varepsilon) h(\bm{x})$

$\nabla_\omega \ell = -\frac{2p-1}{\tilde{p}} \varepsilon(1-\varepsilon) h(\bm{x})$

5. Theoretical Properties

FA-DPO provides strong guarantees under mild assumptions:

Consistency: If the flip model $\omega$ exactly recovers the true $\varepsilon_{\bm{x}}$ with $\varepsilon < 0.5$ everywhere, then

$\arg\min_\theta \mathbb{E}_{\mathcal{D}_\text{noisy}}[-\log \tilde{p}_\theta] = \arg\min_\theta \mathbb{E}_{\mathcal{D}_\text{clean}}[-\log p_\theta].$

Thus, FA-DPO finds the same optimum as if training on uncorrupted data.

Special cases: If $\varepsilon_{\bm{x}} \equiv 0$ , the method reduces to standard DPO. With $\varepsilon_{\bm{x}}$ constant, FA-DPO reduces to per-sample corrections similar to cDPO and rDPO, but now with data-dependent flipping rates.
Convergence: For fixed (or nearly accurate) $p_\theta$ , the subproblem in $\omega$ is strongly convex and smooth under boundedness and coverage. Gradient descent on $\omega$ achieves linear convergence in this regime.

6. Empirical Results and Method Comparisons

Experiments were conducted primarily on the Ultrafeedback dataset (≈61k pairs) and Anthropic HH_Golden (≈42.5k pairs). Flipping was simulated using the instance-dependent model fit to length features, with controlled overall ratios $\eta \in \{0\%, 10\%, 20\%, 30\%, 40\%\}$ . For each triplet, flips were sampled via $u \sim \mathrm{Uniform}[0,1]$ , triggering swaps if $u < \varepsilon_{\bm{x}}$ .

Evaluation metrics:

Prediction accuracy (ACC): $\Pr\bigl(\hat r_\theta(x,y_w) > \hat r_\theta(x,y_l)\bigr)$ on clean test data.
Win rate (WR): Fraction of model generations preferred by an LLM-based judge (DeepSeek-V3 or GPT-4o), compared to an SFT reference model.

Baselines in comparison included DPO (vanilla), SIMPO, ROPO, cDPO, and rDPO.

Quantitative results on Pythia-1B/Ultrafeedback are summarized as follows:

$\eta$ (flip %)	DPO	cDPO	rDPO	FA-DPO
0%	68.2/67.0	67.2/66.7	70.1/65.6	73.1/66.9
10%	61.8/60.1	62.6/67.8	65.9/56.7	67.2/68.5
20%	58.6/56.4	59.0/66.0	61.7/54.9	69.8/66.9
30%	55.4/64.7	56.7/66.6	56.9/57.9	71.0/69.8
40%	51.9/64.3	53.9/67.1	47.7/57.8	70.8/69.8

FA-DPO achieves the highest accuracy and robust win rates across all flip ratios.

Ablative findings:

Warming up $\omega$ for several DPO steps before joint training leads to more stable convergence.
Moderate alternation frequencies for $(N_\omega, N_\theta)$ (e.g., 20/20) yield favorable trade-offs.

Flip model behavior analyses reveal:

Predicted $\varepsilon$ correlates strongly with true flips ( $R^2 \approx 0.75$ ).
Flipped and non-flipped samples are distinctly separated in predicted $\varepsilon$ .
Samples with longer average response length and smaller margin have higher predicted flip rates.

7. Context and Relations to Other Approaches

FA-DPO extends direct preference optimization approaches to robustly handle label noise modeled as instance-dependent flips. Related variants include constant (cDPO) and additive (rDPO) noise-corrected DPO, but these lack the ability to adapt to input-dependent corruption patterns. SIMPO and ROPO represent additional variants for robust preference optimization. All aforementioned methods were evaluated as baselines in the original FA-DPO study (Xu et al., 30 Nov 2025). A plausible implication is that incorporating instance-dependent corruption models can further generalize to other forms of annotation noise within RLHF and enhance robustness for LLM alignment tasks.

Markdown Report Issue Upgrade to Chat

References (1)

When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Flipping-Aware Direct Preference Optimization (FA-DPO).

Flipping-Aware Direct Preference Optimization

1. Flipped Preference Noise: Modeling and Problem Setting

2. Instance-Dependent Flipping Probability Estimation

3. Robust Objective: The FA-DPO Loss

4. Joint Optimization Procedure

5. Theoretical Properties

6. Empirical Results and Method Comparisons

7. Context and Relations to Other Approaches

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Flipping-Aware Direct Preference Optimization

1. Flipped Preference Noise: Modeling and Problem Setting

2. Instance-Dependent Flipping Probability Estimation

3. Robust Objective: The FA-DPO Loss

4. Joint Optimization Procedure

5. Theoretical Properties

6. Empirical Results and Method Comparisons

7. Context and Relations to Other Approaches

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research