Papers
Topics
Authors
Recent
Search
2000 character limit reached

Flipping-Aware Direct Preference Optimization

Updated 7 December 2025
  • The paper introduces FA-DPO, which robustly integrates instance-dependent flipping probabilities into direct preference optimization to correct human-annotated noise.
  • It employs a two-stage generative model using a Bradley-Terry framework and logistic regression on features like response length, perplexity, and reward margin.
  • Empirical results on datasets such as Ultrafeedback demonstrate that FA-DPO achieves superior accuracy and win rates across various flipping ratios.

Flipping-Aware Direct Preference Optimization (FA-DPO) is an algorithmic framework designed to make reinforcement learning with human feedback (RLHF) robust to preference flipping in human-annotated datasets. Preference flipping refers to the corruption of pairwise preference labels due to various external factors after the initial (intended) human annotation. FA-DPO models such corruption instance-dependently, explicitly estimates flipping probabilities, and integrates this correction into the Direct Preference Optimization (DPO) objective, leading to significant gains in robustness for model alignment tasks (Xu et al., 30 Nov 2025).

1. Flipped Preference Noise: Modeling and Problem Setting

The foundation of FA-DPO lies in a two-stage generative model for human preference data. First, genuine human intent is modeled by a Bradley-Terry (BT) model. For prompt xx with candidate responses ywy_w (winner) and yly_l (loser), the true (clean) preference probability is

p(ywylx)=σ(r(x,yw)r(x,yl)),σ(z)=11+ez,p^*(y_w \succ y_l \mid x) = \sigma(r^*(x, y_w) - r^*(x, y_l)), \quad \sigma(z) = \frac{1}{1+e^{-z}},

where rr^* is an unobserved reward function.

Subsequently, an external corruption process may flip the label with instance-dependent probability εx\varepsilon_{\bm{x}}, where x=(x,yw,yl)\bm{x} = (x, y_w, y_l) denotes the data triplet. The observed (possibly flipped) preference probability is

p~=(1εx)p+εx(1p)=pεx(2p1).\tilde{p} = (1-\varepsilon_{\bm{x}}) \cdot p^* + \varepsilon_{\bm{x}} \cdot (1-p^*) = p^* - \varepsilon_{\bm{x}} (2p^*-1).

Equivalently,

P~{ywylx}=(1εx)P{ywylx}+εxP{ylywx}.\tilde{\mathbb{P}}\{y_w \succ y_l \mid x\} = (1-\varepsilon_{\bm{x}}) \mathbb{P}\{y_w \succ y_l \mid x\} + \varepsilon_{\bm{x}} \mathbb{P}\{y_l \succ y_w \mid x\}.

This model captures both intent and systematic, instance-dependent annotation noise.

2. Instance-Dependent Flipping Probability Estimation

FA-DPO parameterizes the flipping probability εx\varepsilon_{\bm{x}} as a logistic function of features h(x)Rdh(\bm{x}) \in \mathbb{R}^d: εx=σ(ω,h(x)+ω0)\varepsilon_{\bm{x}} = \sigma(\langle \omega, h(\bm{x}) \rangle + \omega_0) with trainable parameters ω,ω0\omega, \omega_0.

The feature vector h(x)h(\bm{x}) concatenates three groups of permutation-equivariant statistics:

  • Response length features:

hlen(x)=[yw+yl2,  ywyl]h_{\text{len}}(\bm{x}) = \left[ \frac{|y_w| + |y_l|}{2}, \; \left| |y_w| - |y_l| \right| \right]^\top

  • Perplexity features (with respect to current policy πθ\pi_\theta):

hppl(x)=[logπθ(ywx)+logπθ(ylx)2,  logπθ(ywx)πθ(ylx)]h_{\text{ppl}}(\bm{x}) = \left[ \frac{\log \pi_\theta(y_w|x) + \log \pi_\theta(y_l|x)}{2},\; \left| \log \frac{\pi_\theta(y_w|x)}{\pi_\theta(y_l|x)} \right| \right]^\top

  • Reward margin features (with implicit reward r^θ=βlog(πθ/πref)\hat{r}_\theta = \beta \log(\pi_\theta / \pi_\text{ref})):

hmargin(x)=[r^θ(x,yw)+r^θ(x,yl)2,  r^θ(x,yw)r^θ(x,yl)]h_{\text{margin}}(\bm{x}) = \left[ \frac{\hat{r}_\theta(x, y_w) + \hat{r}_\theta(x, y_l)}{2},\; \left| \hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l) \right| \right]^\top

The full feature vector is h(x)=[hlen(x),hppl(x),hmargin(x),1]h(\bm{x}) = [h_{\text{len}}(\bm{x}), h_{\text{ppl}}(\bm{x}), h_{\text{margin}}(\bm{x}), 1]^\top.

The assumptions required for well-posedness include boundedness of h(x)h(\bm{x}) and ω\omega, and positive definiteness of E[h(x)h(x)]\mathbb{E}[h(\bm{x}) h(\bm{x})^\top] for the logistic regression subproblem.

3. Robust Objective: The FA-DPO Loss

FA-DPO generalizes the original DPO loss to account for observed label flipping via a corrupted likelihood:

  • Standard DPO uses

LDPO(θ)=E(x,yw,yl)Dclean[logσ(r^θ(x,yw)r^θ(x,yl))],\mathcal{L}_{\mathrm{DPO}}(\theta) = - \mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}_{\text{clean}}} \left[ \log \sigma( \hat{r}_\theta(x,y_w) - \hat{r}_\theta(x,y_l) ) \right],

with r^θ(x,y)=βlogπθ(yx)πref(yx)\hat{r}_\theta(x,y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)}.

  • FA-DPO on noisy data Dnoisy\mathcal{D}_\text{noisy} replaces the clean probability with the corrupted likelihood:

p~θ(x)=(1εx)pθ+εx(1pθ),\tilde{p}_\theta(\bm{x}) = (1-\varepsilon_{\bm{x}})p_\theta + \varepsilon_{\bm{x}}(1-p_\theta),

leading to

LFA-DPO(θ,ω)=ExDnoisy[log((1εx)pθ+εx(1pθ))]\boxed{ \mathcal{L}_{\text{FA-DPO}}(\theta, \omega) = -\mathbb{E}_{\bm{x} \sim \mathcal{D}_\text{noisy}} \left[ \log\left( (1-\varepsilon_{\bm{x}})p_\theta + \varepsilon_{\bm{x}}(1-p_\theta) \right) \right] }

where pθ=σ(r^θ(x,yw)r^θ(x,yl))p_\theta = \sigma( \hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l) ).

An MLE perspective connects the derivation: maximizing the log-likelihood under the observed (possibly corrupted) labels using the instance-dependent flip model.

4. Joint Optimization Procedure

Optimization is performed by alternating updating the noise model parameters (ω,ω0)(\omega, \omega_0) and the main policy parameters θ\theta. Each update calculates the instance-dependent noisy-label loss and backpropagates through both θ\theta and ω\omega.

Pseudocode for the alternating-update algorithm is as follows:

Step Update Calculation
1. Fix θ\theta, update ω\omega Minimize 1Bilog((1εi)pi+εi(1pi))-\frac{1}{B} \sum_{i} \log((1-\varepsilon_i)p_i + \varepsilon_i(1-p_i)) Compute εi\varepsilon_i, pip_i for minibatch; take gradient step on ω\omega
2. Fix ω\omega, update θ\theta Minimize 1Bilog((1εi)pi+εi(1pi))-\frac{1}{B} \sum_{i} \log((1-\varepsilon_i)p_i + \varepsilon_i(1-p_i)) Recompute with new ω\omega, update θ\theta

Detailed sample-wise gradients:

  • For z=r^θ(x,yw)r^θ(x,yl)z = \hat{r}_\theta(x, y_w) - \hat{r}_\theta(x, y_l), p=σ(z)p = \sigma(z), and p~=(1ε)p+ε(1p)\tilde{p} = (1-\varepsilon)p + \varepsilon(1-p), the gradients are

z[logp~]=12εp~p(1p)\frac{\partial}{\partial z}[-\log \tilde{p}] = -\frac{1-2\varepsilon}{\tilde{p}} p(1-p)

θ=12εp~p(1p)β(θlogπθ(ylx)θlogπθ(ywx))\nabla_\theta \ell = \frac{1-2\varepsilon}{\tilde{p}} p(1-p) \cdot \beta \left( \nabla_\theta \log \pi_\theta(y_l|x) - \nabla_\theta \log \pi_\theta(y_w|x) \right)

ε[logp~]=2p1p~,ωε=ε(1ε)h(x)\frac{\partial}{\partial \varepsilon}[-\log \tilde{p}] = -\frac{2p-1}{\tilde{p}}, \quad \nabla_\omega \varepsilon = \varepsilon (1-\varepsilon) h(\bm{x})

ω=2p1p~ε(1ε)h(x)\nabla_\omega \ell = -\frac{2p-1}{\tilde{p}} \varepsilon(1-\varepsilon) h(\bm{x})

5. Theoretical Properties

FA-DPO provides strong guarantees under mild assumptions:

  • Consistency: If the flip model ω\omega exactly recovers the true εx\varepsilon_{\bm{x}} with ε<0.5\varepsilon < 0.5 everywhere, then

argminθEDnoisy[logp~θ]=argminθEDclean[logpθ].\arg\min_\theta \mathbb{E}_{\mathcal{D}_\text{noisy}}[-\log \tilde{p}_\theta] = \arg\min_\theta \mathbb{E}_{\mathcal{D}_\text{clean}}[-\log p_\theta].

Thus, FA-DPO finds the same optimum as if training on uncorrupted data.

  • Special cases: If εx0\varepsilon_{\bm{x}} \equiv 0, the method reduces to standard DPO. With εx\varepsilon_{\bm{x}} constant, FA-DPO reduces to per-sample corrections similar to cDPO and rDPO, but now with data-dependent flipping rates.
  • Convergence: For fixed (or nearly accurate) pθp_\theta, the subproblem in ω\omega is strongly convex and smooth under boundedness and coverage. Gradient descent on ω\omega achieves linear convergence in this regime.

6. Empirical Results and Method Comparisons

Experiments were conducted primarily on the Ultrafeedback dataset (≈61k pairs) and Anthropic HH_Golden (≈42.5k pairs). Flipping was simulated using the instance-dependent model fit to length features, with controlled overall ratios η{0%,10%,20%,30%,40%}\eta \in \{0\%, 10\%, 20\%, 30\%, 40\%\}. For each triplet, flips were sampled via uUniform[0,1]u \sim \mathrm{Uniform}[0,1], triggering swaps if u<εxu < \varepsilon_{\bm{x}}.

Evaluation metrics:

  • Prediction accuracy (ACC): Pr(r^θ(x,yw)>r^θ(x,yl))\Pr\bigl(\hat r_\theta(x,y_w) > \hat r_\theta(x,y_l)\bigr) on clean test data.
  • Win rate (WR): Fraction of model generations preferred by an LLM-based judge (DeepSeek-V3 or GPT-4o), compared to an SFT reference model.

Baselines in comparison included DPO (vanilla), SIMPO, ROPO, cDPO, and rDPO.

Quantitative results on Pythia-1B/Ultrafeedback are summarized as follows:

η\eta (flip %) DPO cDPO rDPO FA-DPO
0% 68.2/67.0 67.2/66.7 70.1/65.6 73.1/66.9
10% 61.8/60.1 62.6/67.8 65.9/56.7 67.2/68.5
20% 58.6/56.4 59.0/66.0 61.7/54.9 69.8/66.9
30% 55.4/64.7 56.7/66.6 56.9/57.9 71.0/69.8
40% 51.9/64.3 53.9/67.1 47.7/57.8 70.8/69.8

FA-DPO achieves the highest accuracy and robust win rates across all flip ratios.

Ablative findings:

  • Warming up ω\omega for several DPO steps before joint training leads to more stable convergence.
  • Moderate alternation frequencies for (Nω,Nθ)(N_\omega, N_\theta) (e.g., 20/20) yield favorable trade-offs.

Flip model behavior analyses reveal:

  • Predicted ε\varepsilon correlates strongly with true flips (R20.75R^2 \approx 0.75).
  • Flipped and non-flipped samples are distinctly separated in predicted ε\varepsilon.
  • Samples with longer average response length and smaller margin have higher predicted flip rates.

7. Context and Relations to Other Approaches

FA-DPO extends direct preference optimization approaches to robustly handle label noise modeled as instance-dependent flips. Related variants include constant (cDPO) and additive (rDPO) noise-corrected DPO, but these lack the ability to adapt to input-dependent corruption patterns. SIMPO and ROPO represent additional variants for robust preference optimization. All aforementioned methods were evaluated as baselines in the original FA-DPO study (Xu et al., 30 Nov 2025). A plausible implication is that incorporating instance-dependent corruption models can further generalize to other forms of annotation noise within RLHF and enhance robustness for LLM alignment tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Flipping-Aware Direct Preference Optimization (FA-DPO).