PRO: Proximalized Preference Optimization

Updated 3 March 2026

PRO is a unified algorithm that fine-tunes large language models with diverse feedback types, ensuring well-posed optimization through a theoretically grounded regularizer.
It leverages a mass-covering forward-KL regularizer and a hyper-response approximation to address likelihood underdetermination and mitigate reward-hacking common in DPO frameworks.
Empirical evaluations on benchmark datasets demonstrate that PRO achieves stable alignment with minimal pathological behaviors across pairwise, binary, and scalar feedback.

Proximalized Preference Optimization (PRO) is a unified algorithm for fine-tuning LLMs using preference feedback of diverse types—pairwise, binary, or scalar—while explicitly addressing pathologies inherent in contrastive alignment objectives such as Direct Preference Optimization (DPO). PRO reintroduces a theoretically grounded regularization term omitted in earlier DPO frameworks, thereby guaranteeing well-posedness of the optimization and empirically eliminating reward-hacking behaviors. This method leverages a mass-covering forward-KL regularizer and a scalable hyper-response approximation to achieve efficient, stable alignment at parity with standard direct alignment techniques (Guo et al., 29 May 2025).

1. Theoretical Background and Decoupled Reformulation

Direct Preference Optimization (DPO) aligns LLMs by maximizing the likelihood difference between preferred ( $y_w$ ) and dispreferred ( $y_l$ ) responses, relative to a reference model $\pi_{\text{ref}}$ and controlled by a scaling parameter $\beta$ . Specifically, DPO defines a parametric reward

$r_\theta(x, y) = \beta \log \left[ \frac{\pi_\theta(y|x)}{\pi_\text{ref}(y|x)} \right]$

and optimizes

$\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_\text{ref}) = - \mathbb{E}_{(x, y_w, y_l) \sim D}\left[ \log \sigma\left( r_\theta(x, y_w) - r_\theta(x, y_l) \right) \right]$

where $\sigma(z) = 1 / (1 + e^{-z})$ . However, DPO only constrains the relative difference in log-likelihoods, rendering its solution underdetermined with respect to the absolute scales—a phenomenon termed "likelihood underdetermination". This underdetermination enables reward-hacking, manifest as degenerate output distributions (e.g., excessively long or short responses) that exploit the objective's symmetries.

A key theoretical insight is that the population (infinite-data) DPO loss can be decomposed: $\mathcal{L}_{\text{eDPO}}(\pi_\theta; \pi_\text{ref}) = -\beta \mathbb{E}_{x, y \sim \mu}\left[ s(y|x) \log \pi_\theta(y|x) \right] + \frac{1}{2} \mathbb{E}_{x, y_1, y_2 \sim \mu}\left[ \text{KL}_{B(1/2)}\left( \text{Bern}(\sigma(r_\theta(y_1) - r_\theta(y_2))) \right) \right]$ where $s(y|x) = \mathbb{E}_{y' \sim \mu}[p(y \succ y'|x)] - \frac{1}{2}$ is the empirical preference score and the regularizer involves the KL divergence between a uniform and predicted pairwise preference for all responses. Standard DPO in practice omits the regularizer, permitting likelihood drift.

2. The PRO Objective and Hyper-Response Approximation

PRO reinstates the full regularizer with a tunable tradeoff $\alpha$ and generalizes the objective to arbitrary feedback modalities: $\mathcal{L}_{\text{PRO}}(\pi_\theta; \pi_\text{ref}) = -\beta \mathbb{E}_{(x, y) \sim \hat\mu}[ \hat{s}(y|x) \log \pi_\theta(y|x) ] + \frac{\alpha}{2} \mathbb{E}_{x; y_1, y_2 \sim \mu} [ \text{KL}_{B(1/2)}( \text{Bern}(\sigma( r_\theta(y_1) - r_\theta(y_2) ) ) ) ]$ Here, $\hat\mu$ and $\hat{s}$ denote the empirical response distribution and score, instantiated differently for each feedback type: for pairwise, $\hat{s}(y)=\mathbb{E}_{y'}[\hat{p}(y \succ y')] - \frac{1}{2}$ ; for binary, $\hat{s}(y)=b(y) - \mathbb{E}[b]$ ; for scalar, $\hat{s}(y)=s(y) - \mathbb{E}[s]$ .

To avoid intractable summations over all possible responses, PRO introduces a "hyper-response" $\mathcal{G}$ to collectively represent unlabeled response mass. The regularizer is then computed over the augmented response set $\mathcal{Y}_\mathcal{G} = \{\text{labeled}\ y\} \cup \{\mathcal{G}\}$ , enabling efficient $O(L^2)$ evaluation per prompt, where $L$ is the number of labeled responses.

3. Unified Algorithmic Implementation

PRO provides a single pseudocode applicable to pairwise, binary, and scalar feedback by appropriately configuring empirical distributions and scores.

initialize θ ← θ_ref
for epoch in 1...T:
  for minibatch B ⊆ D:
    for example in B:
      if pairwise:
        extract (x, y_w, y_l)
        ĥatμ(y_w)=ĥatμ(y_l)=½, ŝ(y_w)=+½, ŝ(y_l)=−½
      elif binary:
        (x, y, b), set ĥatμ(y)=1, ŝ(y)=b−mean_b
      elif scalar:
        (x, {y_i,s_i}), set ĥatμ(y_i)=1/N, ŝ(y_i)=s_i−mean_s
      define 𝒢=“all other”, compute μ(𝒢), π_θ(𝒢), π_ref(𝒢)
      Compute optimizer term:
        L_opt = −β ∑_{y∈labeled} ĥatμ(y)·ŝ(y)·log π_θ(y|x)
      Compute regularizer:
        L_reg = (α/2) ∑_{y₁,y₂ ∈ 𝒴_𝒢} μ(y₁)μ(y₂)·KL[B(½)||Bern(σ(r_θ(y₁)−r_θ(y₂)))]
      ℓ_total = L_opt + L_reg
      θ ← θ − η∇_θℓ_total
return π_θ

The boundedness of weights on $\nabla \log \pi$ (via $\hat{s} \in [-\frac{1}{2}, \frac{1}{2}]$ ) and the proximal nature of the regularizer contrasts with unbounded rewards and reverse-KL regularization in RLHF/PPO variants.

4. Resolution of Likelihood Underdetermination and Reward Exploitation

Likelihood underdetermination is the ambiguity whereby only relative, not absolute, log-likelihoods between responses affect the alignment loss, permitting transformations that preserve differences but disturb normalization or expected response behaviors. Omitting the regularizer, as in standard DPO, renders this issue inevitable.

Including the complete regularizer resolves the underdetermination. A KKT analysis of the PRO optimum shows that, under mild assumptions: $\alpha \mathbb{E}_{y' \sim \mu}[ \sigma(r^*(y) - r^*(y')) - 1/2 ] = [\hat{\mu}(y)/\mu(y)]\, \hat{s}(y)$ The strictly monotonic nature of the sigmoid enforces unique determination of the offset $C$ for $r^*(y)$ , with $\pi^*(y) = \pi_\text{ref}(y) \exp(r^*(y) / \beta )$ normalized over all $y$ . This guarantees that absolute likelihoods are set unambiguously, removing the floating scale and, consequently, reward-hacking such as length exploitation.

5. Empirical Results and Comparative Evaluation

PRO was evaluated on benchmark pairwise, binary, and scalar feedback datasets. Key results include:

On Anthropic-HH (pairwise, Pythia-6.9B), DPO induces +150% length increase and a -10% win-rate drop versus the preferred baseline, while PRO variants exhibit negligible length change and maintain or improve win-rate; KTO shows +80% length and -6.6% win-rate, NCA matches PRO performance.
On UltraFeedback (scalar, Mistral-7B-sft), across downstream QA tasks (AlpacaEval 2, MT-Bench, ARC, IFEval, TruthfulQA, GPQA): PRO-B and PRO-P attain the highest average ranks (1.7 and 2.0, respectively), outperforming DPO, KTO, NCA, and SFT.
For imbalanced binary feedback (1% desired label), PRO-B maintains >50% win rate via tuning $\alpha$ , while DPO and KTO collapse.
With more suboptimal examples in scalar feedback (N=4), PRO-S further improves QA benchmark scores.

These results indicate that PRO eliminates reward-hacking and improves robustness across feedback types, with computational costs matched to DPO due to the hyper-response regularizer.

Method	ΔLength (%)	ΔWin-rate (%)
DPO	+150%	-10%
PRO-P	+0%	+0%
PRO-B	+1%	+1%
KTO	+80%	-6.6%
NCA	+0%	+0%

6. Significance, Limitations, and Future Directions

PRO provides a principled alignment mechanism combining pointwise optimization with a mass-covering, forward-KL regularizer, ensuring stable absolute likelihoods and suppressing pathological behaviors such as reward hacking. The hyper-response approximation renders PRO practically as efficient as DPO, and its design maintains bounded update magnitudes.

Ongoing research includes:

Extending PRO to on-policy RLHF by learning the score model $\hat{s}(y)$ dynamically.
Investigating alternative divergences, such as $\alpha$ -divergences, in the regularizer term.
Analyzing the impact of PRO’s mass-covering regularizer on generation diversity and reasoning task exploration.

A plausible implication is that PRO's stability and unified structure may facilitate broader adoption for preference-based alignment in safety-critical and diverse-feedback scenarios (Guo et al., 29 May 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PRoximalized PReference Optimization (PRO).

PRO: Proximalized Preference Optimization

1. Theoretical Background and Decoupled Reformulation

2. The PRO Objective and Hyper-Response Approximation

3. Unified Algorithmic Implementation

4. Resolution of Likelihood Underdetermination and Reward Exploitation

5. Empirical Results and Comparative Evaluation

6. Significance, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

PRO: Proximalized Preference Optimization

1. Theoretical Background and Decoupled Reformulation

2. The PRO Objective and Hyper-Response Approximation

3. Unified Algorithmic Implementation

4. Resolution of Likelihood Underdetermination and Reward Exploitation

5. Empirical Results and Comparative Evaluation

6. Significance, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research