LiPO-λ: Listwise Preference Optimization

Updated 8 May 2026

LiPO-λ is a listwise preference optimization method that leverages LambdaLoss weighting on ranked response sets to produce label-sensitive updates.
The algorithm employs a listwise learning-to-rank objective, integrating gain and discount functions to adjust pairwise contributions based on response ranking magnitudes.
Empirical evaluations on summarization and dialogue tasks demonstrate that LiPO-λ outperforms methods like DPO and SLiC in both proxy reward and human quality assessments.

LiPO-λ (Lambda-Loss Listwise Preference Optimization) is a listwise policy optimization algorithm developed to align LLMs with rankwise human or AI-generated preference data. Building on the observation that preference feedback in practical LM alignment often consists of ranked lists rather than binary comparisons, LiPO-λ leverages a listwise learning-to-rank (LTR) objective incorporating LambdaLoss weighting, producing listwise- and label-sensitive updates. It generalizes several prominent preference optimization objectives, including DPO and SLiC, and empirically outperforms these on canonical LLM alignment tasks (Liu et al., 2024).

1. Listwise Objective and LambdaLoss Formulation

LiPO-λ treats each datapoint as a prompt $x$ paired with a list of $K$ responses $\mathbf{y} = (y_1, \ldots, y_K)$ and their corresponding scalar preference scores $\boldsymbol\psi = (\psi_1,\ldots,\psi_K)$ . For each response, a score is computed: $s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}$ where $\pi_\theta$ is the trainable policy, $\pi_{\rm ref}$ is the fixed reference (SFT) policy, and $\beta>0$ is a KL-control coefficient.

The core per-example loss is: $\ell_{\lambda}(\boldsymbol\psi, \mathbf{s}) = -\sum_{i,j:\,\psi_i>\psi_j} \Delta_{i,j}\log\left(1 + e^{-(s_i - s_j)}\right)$ with Lambda weight

$\Delta_{i,j} = |G(\psi_i) - G(\psi_j)| \cdot \left| D(\tau(i))^{-1} - D(\tau(j))^{-1} \right|$

where

$K$ 0 (gain function), commonly $K$ 1
$K$ 2 (discount function), typically $K$ 3
$K$ 4 is the predicted rank of item $K$ 5 under model scores $K$ 6 (sorted descending).

The full objective averages this loss over all prompt–response lists in the dataset $K$ 7: $K$ 8 This design leverages all $K$ 9 response pairs and adapts their contribution via LambdaLoss scaling.

2. LambdaLoss Weighting: Listwise and Label Sensitivity

The “λ” in LiPO-λ specifically refers to the LambdaLoss weighting scheme. Every pair $\mathbf{y} = (y_1, \ldots, y_K)$ 0 with $\mathbf{y} = (y_1, \ldots, y_K)$ 1 is weighted not uniformly, but by $\mathbf{y} = (y_1, \ldots, y_K)$ 2, incorporating:

Gain sensitivity: $\mathbf{y} = (y_1, \ldots, y_K)$ 3 incorporates the magnitude of preference between responses, in contrast to merely using their ordering.
Listwise sensitivity: $\mathbf{y} = (y_1, \ldots, y_K)$ 4 introduces dependence on the full ranking of items as predicted by the current model policy.

Omitting these weights ( $\mathbf{y} = (y_1, \ldots, y_K)$ 5) reduces the objective to the plain pairwise logistic (Bradley–Terry) loss. With $\mathbf{y} = (y_1, \ldots, y_K)$ 6, this yields the DPO $\mathbf{y} = (y_1, \ldots, y_K)$ 7 loss. The label- and listwise-sensitivity are essential for exploiting the structure of $\mathbf{y} = (y_1, \ldots, y_K)$ 8 preference lists and for more faithful alignment to ranking metrics such as discounted cumulative gain (DCG).

3. Gradient Computation and Optimization

LiPO-λ's pairwise logistic kernel yields nearly closed-form gradients with respect to model scores. For each $\mathbf{y} = (y_1, \ldots, y_K)$ 9,

$\boldsymbol\psi = (\psi_1,\ldots,\psi_K)$ 0

where $\boldsymbol\psi = (\psi_1,\ldots,\psi_K)$ 1.

The policy parameter gradients follow by the chain rule: $\boldsymbol\psi = (\psi_1,\ldots,\psi_K)$ 2 Optimization proceeds via stochastic gradient descent (e.g., Adam or Adafactor), with policy updates: $\boldsymbol\psi = (\psi_1,\ldots,\psi_K)$ 3 where $\boldsymbol\psi = (\psi_1,\ldots,\psi_K)$ 4 is the learning rate.

4. Comparative Analysis with DPO and SLiC

LiPO-λ subsumes earlier objectives as limiting cases:

DPO $\boldsymbol\psi = (\psi_1,\ldots,\psi_K)$ 5: Set $\boldsymbol\psi = (\psi_1,\ldots,\psi_K)$ 6, $\boldsymbol\psi = (\psi_1,\ldots,\psi_K)$ 7 to recover a pairwise logistic loss.
SLiC $\boldsymbol\psi = (\psi_1,\ldots,\psi_K)$ 8: For $\boldsymbol\psi = (\psi_1,\ldots,\psi_K)$ 9 and a hinge kernel, recovers normalized hinge loss.
DPO $s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}$ 0 (Plackett–Luce list-MLE): Optimizes

$s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}$ 1

but only respects the static label permutation, not label magnitudes or predicted permutation.

By contrast, LiPO-λ:

Uses all $s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}$ 2 pairs with non-uniform, listwise $s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}$ 3.
Encodes both label-magnitude and predicted-rank (permutation) sensitivity.
Retains a smooth logistic kernel, in contrast to the listwise hinge in SLiC.

The following table summarizes these distinctions:

Method	Pairwise/Listwise	Label Sensitivity	Kernel Type
DPO $s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}$ 4	Pairwise ( $s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}$ 5)	No	Logistic (BT)
SLiC $s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}$ 6	Pairwise ( $s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}$ 7)	No	Hinge
DPO $s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}$ 8	Listwise	Ordering only	List-MLE (PL)
LiPO-λ	Listwise ( $s_i = \beta \log\frac{\pi_\theta(y_i\mid x)}{\pi_{\rm ref}(y_i\mid x)}$ 9)	Yes (magnitude)	Logistic (RankNet)

5. Empirical Performance

LiPO-λ was evaluated on two public LM alignment tasks:

Reddit TL;DR summarization
AnthropicHH dialogue

Models were fine-tuned from a T5-large (770M) SFT baseline. For each prompt, $\pi_\theta$ 0 response candidates were sampled ( $\pi_\theta$ 1, top- $\pi_\theta$ 2), and all $\pi_\theta$ 3 pairs were labeled using a T5-XXL reward model. Training was conducted with batch size $\pi_\theta$ 4, learning rate $\pi_\theta$ 5, and $\pi_\theta$ 6.

Automatic evaluation against the reward model (“proxy reward”) and via PaLM 2–IT side-by-side (“AutoSxS”) was complemented by human side-by-side and pointwise quality assessments.

Reported Automatic Metrics (Proxy Reward and AutoSxS, Table 1; T5-large policy)

Method	TL;DR (Proxy)	HH (Proxy)	TL;DR (AutoSxS)	HH (AutoSxS)
DPO $\pi_\theta$ 7	88.52%	91.11%	67.09%	44.80%
DPO $\pi_\theta$ 8	88.27%	90.61%	67.23%	43.25%
LiPO-λ	90.60%	92.60%	68.06%	47.90%

With a T5-XXL policy, LiPO-λ led by approximately 1 percentage point on both metrics (Table 2).

Human Side-by-Side (Table 3)

For TL;DR, LiPO-λ was preferred 40% of the time (compared to 19%/16% for baselines); for HH, LiPO-λ attained 27% preference (20%/20% for baselines), with higher mean pointwise quality ratings.

This suggests that listwise and label-sensitive objectives enable more effective use of listwise preference feedback, particularly as $\pi_\theta$ 9 increases.

6. Implementation and Training Schema

Minimal pseudocode expressing the LiPO-λ pipeline follows the procedure outlined in Algorithm 1:

$\pi_{\rm ref}$ 7 Key hyperparameters are: optimizer = Adafactor/Adam, learning rate $\pi_{\rm ref}$ 0, batch size $\pi_{\rm ref}$ 1, $\pi_{\rm ref}$ 2, $\pi_{\rm ref}$ 3, sampling temperature $\pi_{\rm ref}$ 4, and top_k $\pi_{\rm ref}$ 5.

7. Significance and Utilization

LiPO-λ augments standard pairwise preference optimization with listwise-aware LambdaLoss weights, enabling smooth optimization and effective learning from $\pi_{\rm ref}$ 6 preference lists. It provides a principled framework for mapping LM alignment to LTR objectives, formally subsumes important special cases (DPO, SLiC), and empirically delivers consistent gains across alignment benchmarks (Liu et al., 2024). A plausible implication is that as ranked preference data becomes more prevalent, listwise objectives such as LiPO-λ represent a robust methodological direction for preference-based LM alignment.

Markdown Report Issue Upgrade to Chat

References (1)

LiPO: Listwise Preference Optimization through Learning-to-Rank (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LiPO-λ.