Relative Preference Optimization

Updated 10 November 2025

Relative Preference Optimization is a family of algorithms that generalize pairwise learning using weighted comparisons to enhance model alignment with human and domain-specific preferences.
RPO integrates cross-prompt and hierarchical data with embedding-based weighting to improve data efficiency and stabilize convergence, benefiting applications from medical reasoning to image generation.
Practical implementations of RPO show measurable gains in win rate and accuracy over traditional RLHF and DPO methods, supporting advanced tasks like language summarization and combinatorial optimization.

Relative Preference Optimization (RPO) is an advanced family of preference-based learning algorithms designed to align machine learning models—especially LLMs, diffusion models, and combinatorial optimizers—with nuanced human or domain-specific preferences. RPO generalizes classic pairwise preference learning by exploiting the structure of cross-prompt, cross-domain, or hierarchical comparison data. It enables more robust preference modeling, data-efficient training, and stable convergence properties across a diverse range of applications, from medical reasoning to robust retrieval, multi-modal generation, and combinatorial optimization.

1. Mathematical Foundations of RPO

Relative Preference Optimization formalizes model alignment through weighted pairwise (or groupwise) preference losses. Let $\pi_\theta(y\mid x)$ denote the trainable policy, $\pi_\mathrm{ref}(y\mid x)$ a fixed reference policy, and $\beta$ a scaling hyperparameter. The canonical RPO loss for a batch of $M$ “win” and $N$ “lose” samples is:

$L_{\rm RPO}(\theta) = -\frac{1}{M N} \sum_{i=1}^M \sum_{j=1}^N \log \sigma\left( \omega_{ij}\; \beta \left[ \log \tfrac{\pi_\theta(y_{w,i}\mid x_i)}{\pi_\mathrm{ref}(y_{w,i}\mid x_i)} - \log \tfrac{\pi_\theta(y_{l,j}\mid x_j)}{\pi_\mathrm{ref}(y_{l,j}\mid x_j)} \right] \right)$

where $\omega_{ij}$ is a row-normalized weight reflecting contrast strength (often via embedding similarity), and $\sigma(z) = 1/(1 + e^{-z})$ is the logistic sigmoid. This cross-pairing generalizes the “diagonal” pairing in Direct Preference Optimization (DPO) so as to incorporate both intra-prompt and inter-prompt contrasts (Yin et al., 2024).

The RPO framework supports further generalization:

Multi-candidate settings ( $K > 2$ responses per prompt), via distance metrics like squared loss or categorical KL (Sun et al., 31 Jan 2025).
Integration of explicit or implicit reward models, balancing ground-truth preference labels versus likelihood-induced margins.
Hierarchical and group-based formulations, e.g., incorporating ground-truth, model-generated—but-correct, and model-generated incorrect responses (Kawakami et al., 25 Apr 2025).

2. RPO Methodological Variants and Weighting Mechanisms

RPO unifies a diverse ecosystem of preference-based objectives:

Variant	Data Pairing	Loss Structure
DPO	Paired, identical	Log-sigmoid on paired margins
RPO	Cross-prompt, mixed	Weighted log-sigmoid, cross-pairing
SimPO, IPO	Paired/Categorical	Squared or KL divergence
REINFORCE-LOO	Multi-response	Leave-one-out/squared distance

Contrastive weighting ( $\omega_{ij}$ ) may be:

Uniform (all pairs treated equally)
Diagonal-emphasized (higher weight on same-prompt pairs)
Embedding-distance–reweighted (using pretrained sentence or multi-modal encoders, softmax-normalized with temperature $\tau$ ; see (Yin et al., 2024, Gu et al., 2024)).

Embedding reweighting places greater emphasis on contrast pairs whose prompts or modalities are semantically similar, harnessing richer implicit preference signals for generalization across tasks and domains.

3. Algorithmic Implementation Details

Key RPO instantiations follow algorithmic steps:

Data Preparation: Preference datasets are mined from human annotation, model completions with correctness filtering, or synthetic “rich” critique–edit pipelines (e.g., VLM-driven image improvement (Zhao et al., 13 Mar 2025)). Hierarchies and cross-prompt pairing are constructed explicitly to maximize contrast diversity.
Loss Computation: For each preference pair, likelihood ratios are computed; contrastive weights $\omega_{ij}$ are applied; the aggregate loss is backpropagated to update model parameters.
Optimization Protocols: Modern RPO implementations rely on efficient adapters (QLoRA (Kawakami et al., 25 Apr 2025)), single- or multi-epoch training, and often exploit LoRA-based quantization for tractability with ultra-large models.

Implementation typically involves batching, reference model scoring, hyperparameter tuning (notably $\beta$ and loss scaling such as $\alpha$ for additional NLL terms), and embedding computation for contrastive weights. Convergence is usually achieved in one to few passes over substantial preference datasets (~ $10^4$ – $10^5$ examples) on large-scale GPU hardware (2×A100 80GB for medical LLMs (Kawakami et al., 25 Apr 2025)).

4. Applications in Model Alignment and Reasoning Stability

Relative Preference Optimization substantially generalizes and strengthens prior preference-learning schemes across domains:

Medical LLMs: RPO stabilizes reasoning chains, maintaining answer accuracy even when generation is forced to include stepwise explanations (Kawakami et al., 25 Apr 2025). The method achieves state-of-the-art (0.868) accuracy on IgakuQA, demonstrably eliminating the $\sim$ 3% degradation seen with CPT-only fine-tuning.
Language and Summarization: RPO yields consistent improvements in head-to-head win rate versus DPO (+3–6pp) and baseline RLHF (Yin et al., 2024), offering a technique to align models using both paired and unpaired preference data, enabled by embedding-weighted cross-prompt comparisons.
Retrieval-Augmented Generation (RAG): Retrieval Preference Optimization models relevance explicitly within the RPO loss, overcoming mathematical obstacles blocking RLHF/DPO in knowledge-conflict regimes, boosting accuracy by 4–10 pp over vanilla and adaptive RAG, with no increase in inference overhead (Yan et al., 23 Jan 2025).
Image Diffusion Models: RPO applies cross-modal CLIP-based weighting to align Stable Diffusion with human preference and style, outperforms both SFT and DPO on metrics like HPSv2 and FID across style domains (Gu et al., 2024), and supports rich synthetic data curation pipelines integrating VLM-based critique and image editing (Zhao et al., 13 Mar 2025).
Combinatorial Optimization: Preference Optimization stabilizes RL training by transforming scalar rewards into robust pairwise preference signals, preserving exploration and sample efficiency in large-scale COPs (e.g., TSP, CVRP), and allows seamless integration of local search improvements within the training loop (Pan et al., 13 May 2025).
Model Alignment Frameworks: Reward-aware Preference Optimization unifies DPO, IPO, SimPO, REINFORCE-LOO as special cases, providing principled guidance on objective selection, sample size ( $K$ ), reward model choice, and online versus offline iterative training protocols (Sun et al., 31 Jan 2025).

5. Quantitative Outcomes, Limitations, and Ablation Insights

Extensive ablation and benchmark studies underscore core RPO findings:

Setting	Metric	DPO	RPO	PPO
Medical QA (IgakuQA)	Accuracy	0.868	0.868	-
HH Dialogues (Mistral-7B)	Win Rate %	72.26	78.52	58.98
Summarization (Mistral-7B)	Win Rate %	48.83	50.39	39.84
RAG (PopQA)	Accuracy %	59.0	65.4	-
Stable Diffusion (SDXL)	HPSv2	28.082	28.658	-
Combinatorial Opt. (TSP)	Gap %	3.40	2.86	-

Stability and Efficiency: RPO, especially when embedding-reweighted and when supplied with rich hierarchical or cross-domain preferences, consistently outperforms DPO, RLHF/PPO, and SFT in both model win rate and generalization, often with a single training epoch.
No Significant Benefit from Larger $K$ : Increasing the number of candidates per prompt ( $K$ ) above 2 does not yield additional gains, simplifying implementation (Sun et al., 31 Jan 2025).
Iterative Online Training: When a reliable reward model is present, iterative online RPO with backward-KL loss delivers maximal OOD win rates (>93% on AlpacaEval for large models).
Limitations: RPO’s effectiveness depends on preference data quality and embedding models for weighting. Cross-prompt contrast can inject label noise if prompts are only superficially related. Computational cost can be elevated in high-batch cross-pairings or elaborate curation pipelines.
Future Directions: Embedding-free/self-supervised contrast scoring, optimal transport–based weighting, preference modeling for ranked or real-valued feedback, and integration with adversarially robust critics and editors offer open research avenues.

Relative Preference Optimization offers significant procedural advances over RLHF (Reinforcement Learning from Human Feedback):

Reward Model Independence: Many RPO schemes (notably DPO extensions) operate solely via likelihood ratios, requiring no separate reward model fitting or policy-gradient or PPO loops, leading to stable convergence with reduced variance (Kawakami et al., 25 Apr 2025).
Hierarchical and Cross-Domain Flexibility: RPO naturally extends to hierarchical (e.g., ground-truth ≻ correct-generated ≻ incorrect-generated) and mixed-modality comparison, outperforming single-level RLHF approaches.
Integrated Retrieval Awareness: Retrieval Preference Optimization directly incorporates retrieval relevance into model alignment losses, a capability lacking in conventional DPO or RLHF (Yan et al., 23 Jan 2025).
Unified Framework: Reward-aware Preference Optimization mathematically encompasses DPO, IPO, SimPO, and REINFORCE-LOO, permitting both KL-style and regression-style tuning under an explicit oracle or implicit margin (Sun et al., 31 Jan 2025).

7. Theoretical Properties and Convergence Guarantees

Preference-based RPO algorithms inherit stable training dynamics due to convexity in the log-sigmoid margin, favoring single-epoch or few-epoch convergence. In utility-theoretic settings (GLISp-r), plain preference optimization enjoys global convergence guarantees via periodic pure-exploration steps ensuring density of sampled points (Previtali et al., 2022). In reinforcement learning and combinatorial optimization, PO’s sample-based pairwise likelihood loss mitigates reward vanishing and maintains robust exploration. While most LLM RPO applications lack formal convergence proofs, empirical rapid convergence and stability are consistently reported.

RPO establishes a comprehensive, generalizable method for robust preference learning. Through its unification of prior approaches and its capacity to exploit both intra- and inter-domain preference signals, RPO enables superior model alignment in high-stakes, complex settings spanning medicine, retrieval, multi-modal generation, and large-scale discrete optimization.