Papers
Topics
Authors
Recent
Search
2000 character limit reached

LiPO-λ: Listwise Preference Optimization

Updated 8 May 2026
  • LiPO-λ is a listwise preference optimization method that leverages LambdaLoss weighting on ranked response sets to produce label-sensitive updates.
  • The algorithm employs a listwise learning-to-rank objective, integrating gain and discount functions to adjust pairwise contributions based on response ranking magnitudes.
  • Empirical evaluations on summarization and dialogue tasks demonstrate that LiPO-λ outperforms methods like DPO and SLiC in both proxy reward and human quality assessments.

LiPO-λ (Lambda-Loss Listwise Preference Optimization) is a listwise policy optimization algorithm developed to align LLMs with rankwise human or AI-generated preference data. Building on the observation that preference feedback in practical LM alignment often consists of ranked lists rather than binary comparisons, LiPO-λ leverages a listwise learning-to-rank (LTR) objective incorporating LambdaLoss weighting, producing listwise- and label-sensitive updates. It generalizes several prominent preference optimization objectives, including DPO and SLiC, and empirically outperforms these on canonical LLM alignment tasks (Liu et al., 2024).

1. Listwise Objective and LambdaLoss Formulation

LiPO-λ treats each datapoint as a prompt xx paired with a list of KK responses y=(y1,,yK)\mathbf{y} = (y_1, \ldots, y_K) and their corresponding scalar preference scores ψ=(ψ1,,ψK)\boldsymbol\psi = (\psi_1,\ldots,\psi_K). For each response, a score is computed: si=βlogπθ(yix)πref(yix)s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)} where πθ\pi_\theta is the trainable policy, πref\pi_{\rm ref} is the fixed reference (SFT) policy, and β>0\beta>0 is a KL-control coefficient.

The core per-example loss is: λ(ψ,s)=i,j:ψi>ψjΔi,jlog(1+e(sisj))\ell_{\lambda}(\boldsymbol\psi, \mathbf{s}) = -\sum_{i,j:\,\psi_i>\psi_j} \Delta_{i,j}\log\left(1 + e^{-(s_i - s_j)}\right) with Lambda weight

Δi,j=G(ψi)G(ψj)D(τ(i))1D(τ(j))1\Delta_{i,j} = |G(\psi_i) - G(\psi_j)| \cdot \left| D(\tau(i))^{-1} - D(\tau(j))^{-1} \right|

where

  • KK0 (gain function), commonly KK1
  • KK2 (discount function), typically KK3
  • KK4 is the predicted rank of item KK5 under model scores KK6 (sorted descending).

The full objective averages this loss over all prompt–response lists in the dataset KK7: KK8 This design leverages all KK9 response pairs and adapts their contribution via LambdaLoss scaling.

2. LambdaLoss Weighting: Listwise and Label Sensitivity

The “λ” in LiPO-λ specifically refers to the LambdaLoss weighting scheme. Every pair y=(y1,,yK)\mathbf{y} = (y_1, \ldots, y_K)0 with y=(y1,,yK)\mathbf{y} = (y_1, \ldots, y_K)1 is weighted not uniformly, but by y=(y1,,yK)\mathbf{y} = (y_1, \ldots, y_K)2, incorporating:

  • Gain sensitivity: y=(y1,,yK)\mathbf{y} = (y_1, \ldots, y_K)3 incorporates the magnitude of preference between responses, in contrast to merely using their ordering.
  • Listwise sensitivity: y=(y1,,yK)\mathbf{y} = (y_1, \ldots, y_K)4 introduces dependence on the full ranking of items as predicted by the current model policy.

Omitting these weights (y=(y1,,yK)\mathbf{y} = (y_1, \ldots, y_K)5) reduces the objective to the plain pairwise logistic (Bradley–Terry) loss. With y=(y1,,yK)\mathbf{y} = (y_1, \ldots, y_K)6, this yields the DPOy=(y1,,yK)\mathbf{y} = (y_1, \ldots, y_K)7 loss. The label- and listwise-sensitivity are essential for exploiting the structure of y=(y1,,yK)\mathbf{y} = (y_1, \ldots, y_K)8 preference lists and for more faithful alignment to ranking metrics such as discounted cumulative gain (DCG).

3. Gradient Computation and Optimization

LiPO-λ's pairwise logistic kernel yields nearly closed-form gradients with respect to model scores. For each y=(y1,,yK)\mathbf{y} = (y_1, \ldots, y_K)9,

ψ=(ψ1,,ψK)\boldsymbol\psi = (\psi_1,\ldots,\psi_K)0

where ψ=(ψ1,,ψK)\boldsymbol\psi = (\psi_1,\ldots,\psi_K)1.

The policy parameter gradients follow by the chain rule: ψ=(ψ1,,ψK)\boldsymbol\psi = (\psi_1,\ldots,\psi_K)2 Optimization proceeds via stochastic gradient descent (e.g., Adam or Adafactor), with policy updates: ψ=(ψ1,,ψK)\boldsymbol\psi = (\psi_1,\ldots,\psi_K)3 where ψ=(ψ1,,ψK)\boldsymbol\psi = (\psi_1,\ldots,\psi_K)4 is the learning rate.

4. Comparative Analysis with DPO and SLiC

LiPO-λ subsumes earlier objectives as limiting cases:

  • DPOψ=(ψ1,,ψK)\boldsymbol\psi = (\psi_1,\ldots,\psi_K)5: Set ψ=(ψ1,,ψK)\boldsymbol\psi = (\psi_1,\ldots,\psi_K)6, ψ=(ψ1,,ψK)\boldsymbol\psi = (\psi_1,\ldots,\psi_K)7 to recover a pairwise logistic loss.
  • SLiCψ=(ψ1,,ψK)\boldsymbol\psi = (\psi_1,\ldots,\psi_K)8: For ψ=(ψ1,,ψK)\boldsymbol\psi = (\psi_1,\ldots,\psi_K)9 and a hinge kernel, recovers normalized hinge loss.
  • DPOsi=βlogπθ(yix)πref(yix)s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}0 (Plackett–Luce list-MLE): Optimizes

si=βlogπθ(yix)πref(yix)s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}1

but only respects the static label permutation, not label magnitudes or predicted permutation.

By contrast, LiPO-λ:

  • Uses all si=βlogπθ(yix)πref(yix)s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}2 pairs with non-uniform, listwise si=βlogπθ(yix)πref(yix)s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}3.
  • Encodes both label-magnitude and predicted-rank (permutation) sensitivity.
  • Retains a smooth logistic kernel, in contrast to the listwise hinge in SLiC.

The following table summarizes these distinctions:

Method Pairwise/Listwise Label Sensitivity Kernel Type
DPOsi=βlogπθ(yix)πref(yix)s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}4 Pairwise (si=βlogπθ(yix)πref(yix)s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}5) No Logistic (BT)
SLiCsi=βlogπθ(yix)πref(yix)s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}6 Pairwise (si=βlogπθ(yix)πref(yix)s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}7) No Hinge
DPOsi=βlogπθ(yix)πref(yix)s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}8 Listwise Ordering only List-MLE (PL)
LiPO-λ Listwise (si=βlogπθ(yix)πref(yix)s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}9) Yes (magnitude) Logistic (RankNet)

5. Empirical Performance

LiPO-λ was evaluated on two public LM alignment tasks:

  • Reddit TL;DR summarization
  • AnthropicHH dialogue

Models were fine-tuned from a T5-large (770M) SFT baseline. For each prompt, πθ\pi_\theta0 response candidates were sampled (πθ\pi_\theta1, top-πθ\pi_\theta2), and all πθ\pi_\theta3 pairs were labeled using a T5-XXL reward model. Training was conducted with batch size πθ\pi_\theta4, learning rate πθ\pi_\theta5, and πθ\pi_\theta6.

Automatic evaluation against the reward model (“proxy reward”) and via PaLM 2–IT side-by-side (“AutoSxS”) was complemented by human side-by-side and pointwise quality assessments.

Reported Automatic Metrics (Proxy Reward and AutoSxS, Table 1; T5-large policy)

Method TL;DR (Proxy) HH (Proxy) TL;DR (AutoSxS) HH (AutoSxS)
DPOπθ\pi_\theta7 88.52% 91.11% 67.09% 44.80%
DPOπθ\pi_\theta8 88.27% 90.61% 67.23% 43.25%
LiPO-λ 90.60% 92.60% 68.06% 47.90%

With a T5-XXL policy, LiPO-λ led by approximately 1 percentage point on both metrics (Table 2).

Human Side-by-Side (Table 3)

For TL;DR, LiPO-λ was preferred 40% of the time (compared to 19%/16% for baselines); for HH, LiPO-λ attained 27% preference (20%/20% for baselines), with higher mean pointwise quality ratings.

This suggests that listwise and label-sensitive objectives enable more effective use of listwise preference feedback, particularly as πθ\pi_\theta9 increases.

6. Implementation and Training Schema

Minimal pseudocode expressing the LiPO-λ pipeline follows the procedure outlined in Algorithm 1:

πref\pi_{\rm ref}7 Key hyperparameters are: optimizer = Adafactor/Adam, learning rate πref\pi_{\rm ref}0, batch size πref\pi_{\rm ref}1, πref\pi_{\rm ref}2, πref\pi_{\rm ref}3, sampling temperature πref\pi_{\rm ref}4, and top_k πref\pi_{\rm ref}5.

7. Significance and Utilization

LiPO-λ augments standard pairwise preference optimization with listwise-aware LambdaLoss weights, enabling smooth optimization and effective learning from πref\pi_{\rm ref}6 preference lists. It provides a principled framework for mapping LM alignment to LTR objectives, formally subsumes important special cases (DPO, SLiC), and empirically delivers consistent gains across alignment benchmarks (Liu et al., 2024). A plausible implication is that as ranked preference data becomes more prevalent, listwise objectives such as LiPO-λ represent a robust methodological direction for preference-based LM alignment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LiPO-λ.