Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
127 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
53 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Relative Preference Optimization (RPO)

Updated 16 July 2025
  • Relative Preference Optimization (RPO) is an advanced framework that aligns large language models with human preferences by incorporating both paired and semantically diverse comparisons.
  • It constructs a contrast matrix and employs cosine similarity for reweighting, effectively extending traditional DPO methods to capture richer human feedback signals.
  • Empirical results across dialogue and summarization tasks demonstrate that RPO improves model alignment and overall performance compared to standard preference optimization techniques.

Relative Preference Optimization (RPO) is an advanced framework for aligning LLMs with human preferences by extending standard preference learning paradigms such as Direct Preference Optimization (DPO). RPO addresses the limitation of DPO—namely, its exclusive utilization of paired preferences from identical prompts—by additionally incorporating contrastive comparisons across semantically related prompts. This expanded perspective aims to more faithfully approximate the multifaceted and comparative nature of human feedback, resulting in superior alignment of LLMs with nuanced user expectations (2402.10958).

1. Motivation and Theoretical Foundations

Relative Preference Optimization is motivated by the recognition that human learning involves not only matching reactions to strictly identical stimuli, but also comparing responses to akin or thematically related situations. DPO, a prevailing method for preference-based alignment, employs a pairwise objective grounded in the log-probability ratio between a preferred ("win") and a rejected ("loss") response to the same prompt, scaled by a regularization factor β:

LDPO(πθ;πref)=E(x,yw,yl)D[logσ(β(logπθ(ywx)πref(ywx)logπθ(ylx)πref(ylx)))]L_\text{DPO}(\pi_\theta; \pi_\text{ref}) = - \mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma\left( \beta \left( \log \frac{\pi_\theta(y_w|x)}{\pi_\text{ref}(y_w|x)} - \log \frac{\pi_\theta(y_l|x)}{\pi_\text{ref}(y_l|x)} \right) \right) \right]

However, this methodology neglects the rich spectrum of human judgements that use relative and contextually diverse comparisons. RPO augments DPO by explicitly contrasting responses not only across identical prompts but also across prompts with substantive semantic similarity, leveraging a broader, more informative supervisory signal (2402.10958).

2. Methodological Framework and Algorithm

a. Contrast Matrix Construction

RPO builds a batch-wise contrast matrix C\mathbf{C} to encapsulate all possible pairwise comparisons between preferred responses (wins) and non-preferred responses (losses). Let MM and NN denote the numbers of win and loss samples in the batch, respectively:

  • Paired setting: The matrix is M×MM \times M, where diagonal entries correspond to traditional DPO-type comparisons (identical prompts), and off-diagonals correspond to cross-prompt comparisons.
  • Unpaired setting: The matrix is M×NM \times N, allowing for all win–lose combinations across distinct prompts.

For each (i,j)(i, j) pair, the relative preference score is computed as:

sij=ωijβ[logπθ(yw,ixi)πref(yw,ixi)logπθ(yl,jxj)πref(yl,jxj)]s_{ij} = \omega_{ij}\,\beta\,\left[\log \frac{\pi_\theta(y_{w, i}\mid x_i)}{\pi_\text{ref}(y_{w, i}\mid x_i)} - \log \frac{\pi_\theta(y_{l, j}\mid x_j)}{\pi_\text{ref}(y_{l, j}\mid x_j)}\right]

where ωij\omega_{ij} is a contrastive weighting factor quantifying the semantic similarity between prompts xw,ix_{w,i} and xl,jx_{l,j}.

b. Contrastive Weighting Mechanism

A central innovation of RPO is contrastive reweighting, which modulates the influence of each comparison based on prompt similarity. This is achieved by computing cosine similarity between prompt embeddings:

  • Compute f(x)f(x) for each prompt (using, e.g., Sentence-T5-Large or all-MiniLM-L6-v2).
  • Calculate:

τ~ij=exp(cos(f(xw,i),f(xl,j))τ)\tilde{\tau}_{ij} = \exp\left( -\frac{\cos(f(x_{w,i}), f(x_{l,j}))}{\tau} \right)

  • Normalize within the batch:

ωij=τ~ijjτ~ij\omega_{ij} = \frac{\tilde{\tau}_{ij}}{\sum_{j'} \tilde{\tau}_{ij'}}

A lower cosine distance (greater similarity) results in higher ωij\omega_{ij}, focusing learning on semantically proximal comparisons.

c. Final Loss Aggregation

The aggregate RPO loss over the entire contrast matrix is

LRPO=1MNi=1Mj=1Nlogσ(sij)L_\text{RPO} = -\frac{1}{M N} \sum_{i=1}^M \sum_{j=1}^N \log \sigma(s_{ij})

where σ\sigma is the sigmoid function. This loss favors increasing the reward for preferred responses and decreasing it for dispreferred ones, both on the diagonal and off-diagonal contrast pairs (2402.10958).

d. Algorithmic Outline

The training loop for each batch follows:

  1. Compute prompt embeddings.
  2. For every win–lose pair, calculate cosine similarities.
  3. Derive normalized weights ωij\omega_{ij}.
  4. Calculate contrastive scores sijs_{ij}.
  5. Compute the RPO loss.
  6. Update model parameters via gradient descent.

The RPO algorithm can be implemented in existing LLM training infrastructures with minimal overhead beyond the computation of prompt similarities and the additional matrix operations.

3. Empirical Validation and Comparative Performance

Extensive experiments on diverse tasks (including OpenAI Summarization, Anthropic Helpful and Harmless (HH) dialogue, and AlpacaEval2.0) validate RPO’s effectiveness:

  • Dialogue (Mistral-7B): DPO baseline win rate ≈ 72.26%; RPO with paired data and embedding-based weighting achieves 78.52%.
  • Summarization: Analogous improvements observed, demonstrating RPO’s benefits across modalities.
  • Generalization: On AlpacaEval2.0 (instruction-following), RPO consistently outperforms SFT, PPO, DPO, IPO, and KTO, indicating broader robustness.

The contrastive weighting mechanism is empirically superior to uniform or diagonal-only weighting: Semantically aligned off-diagonal pairs carry meaningful learning signals, while weighting all pairs equally or focusing only on diagonals narrows the learning scope (2402.10958).

4. Practical Implications and Implementation Considerations

RPO’s methodology introduces practical opportunities and requirements:

  • Scalability: The matrix-based loss increases computation per batch, especially for large M,NM, N, and for high-dimensional embedding spaces. Efficient batching and optimized similarity computations are crucial for large-scale training.
  • Batch Construction: Mixed batches of paired and unpaired data are supported, broadening the types of available preference data.
  • Embedding Selection: The choice of embedding model (ff) affects similarity quality and, by extension, reweighting effectiveness. Lightweight, high-recall models such as all-MiniLM-L6-v2 may be preferred for efficiency.
  • Hyperparameter Tuning: The temperature τ\tau affords control over the sensitivity to semantic similarity. Smaller τ\tau values produce sharper reweighting adjacencies; tuning is dataset- and task-dependent.
  • Integration: RPO’s design is compatible with prevalent LLM alignment pipelines and can be integrated into existing preference-based fine-tuning workflows with moderate adjustments.

5. Relation to Broader Preference Optimization Paradigms

RPO generalizes and is complementary to other optimization schemes:

  • DPO: RPO strictly contains DPO as a special case (contrast matrix diagonal).
  • Contrastive Learning: The use of semantic similarity-weighted losses connects to established techniques in contrastive representation and metric learning.
  • Online and Offline Settings: RPO is compatible with both static (pre-collected) preference datasets and dynamic, on-the-fly generation of new contrast pairs.
  • General Applicability: While RPO was developed for textual LLM alignment, the theoretical framework naturally extends to other modalities (e.g., diffusion-based generative models), provided a meaningful similarity metric is available for reweighting.

6. Impact and Future Directions

The adoption of RPO marks a significant step in making LLM alignment more reflective of human comparative reasoning. Empirical results indicate enhancements in both alignment and generalization. Future paper areas include:

  • Automated Construction of Semantically Rich Preference Data: Leveraging large, heterogeneous datasets with minimal manual curation.
  • Extension to Other Modalities and Domains: Applying RPO-style optimization to vision-LLMs, code generation, and multi-agent systems.
  • Advanced Similarity Metrics: Exploring richer embedding strategies and modalities for more nuanced contrastive weighting.
  • Robustness to Noisy Preferences: Integrating robust strategies for noise or label uncertainty, possibly informed by parallel advances in robust preference optimization.

In summary, Relative Preference Optimization provides an extensible and empirically validated extension to existing preference-based model alignment strategies. By incorporating contrast signals across semantically diverse prompt–response pairs and applying principled weighting, RPO more closely mimics human comparative learning, leading to improved model alignment as manifested across multiple real-world tasks (2402.10958).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)