PRO: Proximalized Preference Optimization
- PRO is a unified algorithm that fine-tunes large language models with diverse feedback types, ensuring well-posed optimization through a theoretically grounded regularizer.
- It leverages a mass-covering forward-KL regularizer and a hyper-response approximation to address likelihood underdetermination and mitigate reward-hacking common in DPO frameworks.
- Empirical evaluations on benchmark datasets demonstrate that PRO achieves stable alignment with minimal pathological behaviors across pairwise, binary, and scalar feedback.
Proximalized Preference Optimization (PRO) is a unified algorithm for fine-tuning LLMs using preference feedback of diverse types—pairwise, binary, or scalar—while explicitly addressing pathologies inherent in contrastive alignment objectives such as Direct Preference Optimization (DPO). PRO reintroduces a theoretically grounded regularization term omitted in earlier DPO frameworks, thereby guaranteeing well-posedness of the optimization and empirically eliminating reward-hacking behaviors. This method leverages a mass-covering forward-KL regularizer and a scalable hyper-response approximation to achieve efficient, stable alignment at parity with standard direct alignment techniques (Guo et al., 29 May 2025).
1. Theoretical Background and Decoupled Reformulation
Direct Preference Optimization (DPO) aligns LLMs by maximizing the likelihood difference between preferred () and dispreferred () responses, relative to a reference model and controlled by a scaling parameter . Specifically, DPO defines a parametric reward
and optimizes
where . However, DPO only constrains the relative difference in log-likelihoods, rendering its solution underdetermined with respect to the absolute scales—a phenomenon termed "likelihood underdetermination". This underdetermination enables reward-hacking, manifest as degenerate output distributions (e.g., excessively long or short responses) that exploit the objective's symmetries.
A key theoretical insight is that the population (infinite-data) DPO loss can be decomposed: where is the empirical preference score and the regularizer involves the KL divergence between a uniform and predicted pairwise preference for all responses. Standard DPO in practice omits the regularizer, permitting likelihood drift.
2. The PRO Objective and Hyper-Response Approximation
PRO reinstates the full regularizer with a tunable tradeoff and generalizes the objective to arbitrary feedback modalities: Here, and denote the empirical response distribution and score, instantiated differently for each feedback type: for pairwise, ; for binary, ; for scalar, .
To avoid intractable summations over all possible responses, PRO introduces a "hyper-response" to collectively represent unlabeled response mass. The regularizer is then computed over the augmented response set , enabling efficient evaluation per prompt, where is the number of labeled responses.
3. Unified Algorithmic Implementation
PRO provides a single pseudocode applicable to pairwise, binary, and scalar feedback by appropriately configuring empirical distributions and scores.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
initialize θ ← θ_ref for epoch in 1...T: for minibatch B ⊆ D: for example in B: if pairwise: extract (x, y_w, y_l) ĥatμ(y_w)=ĥatμ(y_l)=½, ŝ(y_w)=+½, ŝ(y_l)=−½ elif binary: (x, y, b), set ĥatμ(y)=1, ŝ(y)=b−mean_b elif scalar: (x, {y_i,s_i}), set ĥatμ(y_i)=1/N, ŝ(y_i)=s_i−mean_s define 𝒢=“all other”, compute μ(𝒢), π_θ(𝒢), π_ref(𝒢) Compute optimizer term: L_opt = −β ∑_{y∈labeled} ĥatμ(y)·ŝ(y)·log π_θ(y|x) Compute regularizer: L_reg = (α/2) ∑_{y₁,y₂ ∈ 𝒴_𝒢} μ(y₁)μ(y₂)·KL[B(½)||Bern(σ(r_θ(y₁)−r_θ(y₂)))] ℓ_total = L_opt + L_reg θ ← θ − η∇_θℓ_total return π_θ |
The boundedness of weights on (via ) and the proximal nature of the regularizer contrasts with unbounded rewards and reverse-KL regularization in RLHF/PPO variants.
4. Resolution of Likelihood Underdetermination and Reward Exploitation
Likelihood underdetermination is the ambiguity whereby only relative, not absolute, log-likelihoods between responses affect the alignment loss, permitting transformations that preserve differences but disturb normalization or expected response behaviors. Omitting the regularizer, as in standard DPO, renders this issue inevitable.
Including the complete regularizer resolves the underdetermination. A KKT analysis of the PRO optimum shows that, under mild assumptions: The strictly monotonic nature of the sigmoid enforces unique determination of the offset for , with normalized over all . This guarantees that absolute likelihoods are set unambiguously, removing the floating scale and, consequently, reward-hacking such as length exploitation.
5. Empirical Results and Comparative Evaluation
PRO was evaluated on benchmark pairwise, binary, and scalar feedback datasets. Key results include:
- On Anthropic-HH (pairwise, Pythia-6.9B), DPO induces +150% length increase and a -10% win-rate drop versus the preferred baseline, while PRO variants exhibit negligible length change and maintain or improve win-rate; KTO shows +80% length and -6.6% win-rate, NCA matches PRO performance.
- On UltraFeedback (scalar, Mistral-7B-sft), across downstream QA tasks (AlpacaEval 2, MT-Bench, ARC, IFEval, TruthfulQA, GPQA): PRO-B and PRO-P attain the highest average ranks (1.7 and 2.0, respectively), outperforming DPO, KTO, NCA, and SFT.
- For imbalanced binary feedback (1% desired label), PRO-B maintains >50% win rate via tuning , while DPO and KTO collapse.
- With more suboptimal examples in scalar feedback (N=4), PRO-S further improves QA benchmark scores.
These results indicate that PRO eliminates reward-hacking and improves robustness across feedback types, with computational costs matched to DPO due to the hyper-response regularizer.
| Method | ΔLength (%) | ΔWin-rate (%) |
|---|---|---|
| DPO | +150% | -10% |
| PRO-P | +0% | +0% |
| PRO-B | +1% | +1% |
| KTO | +80% | -6.6% |
| NCA | +0% | +0% |
6. Significance, Limitations, and Future Directions
PRO provides a principled alignment mechanism combining pointwise optimization with a mass-covering, forward-KL regularizer, ensuring stable absolute likelihoods and suppressing pathological behaviors such as reward hacking. The hyper-response approximation renders PRO practically as efficient as DPO, and its design maintains bounded update magnitudes.
Ongoing research includes:
- Extending PRO to on-policy RLHF by learning the score model dynamically.
- Investigating alternative divergences, such as -divergences, in the regularizer term.
- Analyzing the impact of PRO’s mass-covering regularizer on generation diversity and reasoning task exploration.
A plausible implication is that PRO's stability and unified structure may facilitate broader adoption for preference-based alignment in safety-critical and diverse-feedback scenarios (Guo et al., 29 May 2025).