Papers
Topics
Authors
Recent
Search
2000 character limit reached

ListMLE Loss for Ranking Models

Updated 12 March 2026
  • ListMLE Loss is a listwise surrogate loss that uses the Plackett–Luce model to optimize full ranking permutations.
  • It is mathematically principled and convex for linear models, making it effective for search result ranking and recommendation systems.
  • Extensions such as weighted and margin-based variants enhance performance by addressing application-specific metrics and ranking challenges.

ListMLE Loss is a listwise surrogate loss for learning-to-rank, built on the Plackett–Luce probability model for permutations. Its principal application is to train machine learning models to predict rankings over items in a list (e.g., search results, sentences, portfolio assets), directly optimizing the likelihood of ground-truth permutations observed in training data. The formulation is mathematically grounded, convex (for linear models), and provides a smooth, differentiable proxy for sorting-based metrics such as NDCG, making it the de facto listwise loss in state-of-the-art ranking systems (Xia et al., 2019, Jain et al., 2017).

1. Mathematical Foundations: Plackett–Luce Model and ListMLE Loss

Given a list of nn items indexed by {1,,n}\{1,\dots,n\}, with each item ii assigned a real-valued score sis_i (typically si=f(xi)s_i = f(x_i) for an input feature vector xix_i and learned function ff), the Plackett–Luce (PL) model defines a probability distribution over permutations yy of these nn items.

For a ground-truth permutation yy, the PL probability is

P(ys)=i=1nψ(sy(i))k=inψ(sy(k)),P(y|s) = \prod_{i=1}^n \frac{\psi(s_{y(i)})}{\sum_{k=i}^n \psi(s_{y(k)})},

where ψ()\psi(\cdot) is a strictly positive transformation, e.g., ψ(s)=es\psi(s) = e^s (Zhang et al., 2021, Xia et al., 2019). The ListMLE loss is the negative log-likelihood of this probability:

LListMLE(f;x,y)=logP(yx;f)=i=1n[logψ(f(xy(i)))logk=inψ(f(xy(k)))].L_{\mathrm{ListMLE}}(f; x, y) = -\log P(y|x; f) = -\sum_{i=1}^n \bigg[\log \psi\big(f(x_{y(i)})\big) - \log \sum_{k=i}^n \psi\big(f(x_{y(k)})\big)\bigg].

This construction ranks the item assigned to position y(1)y(1) highest, proceeding recursively down the list, always normalizing the remaining unranked items.

The loss is "listwise" because it is defined over full permutations rather than pairs or individual items, and is a smooth surrogate for permutation-level accuracy (Jain et al., 2017).

2. Properties and Theoretical Guarantees

ListMLE defines a proper probability distribution over permutations via the Plackett–Luce model and is thus a principled maximum-likelihood estimator for ranking tasks (Xia et al., 2019). For linear scoring functions, the negative log-likelihood is convex, ensuring a unique global optimum under 2\ell_2 regularization (Xia et al., 2019).

The original ListMLE is shift-invariant under ff+cf \mapsto f + c if and only if ψ(s)=es\psi(s) = e^s (Zhang et al., 2021). For other choices (e.g., sigmoid), this invariance does not hold. Extensions like the generalized ListFold loss achieve shift invariance for arbitrary ψ\psi by defining the loss in terms of score differences.

Under mild conditions, maximizing the PL likelihood is consistent with optimizing top-K ranking quality measures (e.g., NDCG) (Xia et al., 2019).

3. Algorithmic Implementation and Optimization

For each query, training consists of scoring all candidate items, building the PL probability as above, and minimizing the sum of losses across queries. Efficient computation leverages suffix sums and vectorized log-sum-exp operations; per-query cost is O(n)O(n) given proper implementation (Xia et al., 2019). Gradient computation with respect to the scores sis_i involves accumulating softmax weights across lists that still contain ii:

Lsu=1+j=1pesuk=jnesy(k)\frac{\partial L}{\partial s_u} = -1 + \sum_{j=1}^{p} \frac{e^{s_u}}{\sum_{k=j}^n e^{s_{y(k)}}}

where uu appears in position pp (Jain et al., 2017, Xia et al., 2019).

Training proceeds via gradient-based optimizers (SGD, Adam). For boosting variants (e.g., PLRank), the negative gradient with respect to each item's score is used as the pseudo-response (Xia et al., 2019).

4. Variants and Extensions

Several extensions to the canonical ListMLE exist:

  • Weighted ListMLE: Each log-likelihood term is weighted according to an external score (e.g., engagement, relevance gain). In "Rank-to-engage" (Jain et al., 2017), $L_{\rm wListMLE}(g) = -\sum_{i=1}^N s^{(i)} \log P(\pi^{(i)} | \bX^{(i)}; g)$, with s(i)s^{(i)} an observed engagement metric. In relative depth estimation, weights combine a gain function G(sy(i))=2sy(i)1G(s_{y(i)})=2^{s_{y(i)}}-1 and a position-dependent discount D(i)=1/log(i+1)D(i)=1/\log(i+1) (Mertan et al., 2020).
  • Margin-based ListMLE: Augments the original loss with margin penalties. For each selection step, additional log terms penalize any incorrect pick whose probability exceeds a margin γ\gamma. The margin variant is

f~i(j)=logFi,j(j)+k=j+1ni1log(γFi,j(k))\tilde f_i(j) = \log F_{i,j}(j) + \sum_{k=j+1}^{n_i-1} \log(\gamma - F_{i,j}(k))

with

Fi,j(k)=ezi,oi,kl=jniezi,oi,lF_{i,j}(k) = \frac{e^{z_{i, o_{i,k}^*}}}{\sum_{l=j}^{n_i} e^{z_{i, o_{i,l}^*}}}

for a BERT-based model score zi,kz_{i,k} (Zhu et al., 2021).

  • Generalized Listwise Losses: Extensions such as the ListFold loss in long-short portfolio construction pick pairs instead of single items at each permutation stage, generalizing the PL model (Zhang et al., 2021). This recovers ListMLE in the single-item case.

5. Empirical Applications

ListMLE has been applied across information retrieval, financial modeling, sentence ordering, depth estimation, and recommendation tasks:

  • Document and Query Ranking: Used on large-scale datasets such as Yahoo LTR 2010 and Microsoft 30K, where ListMLE-based models match or exceed state-of-the-art systems using only listwise information (Xia et al., 2019).
  • Sentence Ordering: In BERT4SO, ListMLE and its margin-based extension are used for neural sentence ordering. Margin-based ListMLE yields better convergence and accuracy on small datasets, with comparable performance to unmodified ListMLE on large benchmarks (Zhu et al., 2021).
  • Relative Depth Estimation: Weighted ListMLE enables pixel-wise ordering for depth from single images. Weighted formulations penalize top-of-list errors more via gain and discount weighting, outperforming pairwise ordinal losses in mean average precision metrics (Mertan et al., 2020).
  • Long-Short Portfolio Construction: In cross-sectional finance, ListMLE is generalized to pairwise Plackett–Luce losses (ListFold), enabling models to optimize both extremities of the ranking for robust long-short strategies (Zhang et al., 2021).

6. Key Limitations and Considerations

ListMLE requires ground-truth permutations (ties typically broken by random or enumerating orderings); it does not enumerate all n!n! possible permutations per training example—instead, it evaluates only the observed permutation via the recursive PL decomposition (Zhu et al., 2021, Xia et al., 2019).

Its lack of shift-invariance (except under exponential ψ\psi) may restrict some applications unless generalized losses are used (Zhang et al., 2021). In datasets where only relative or implicit preference data is observed rather than full ground-truth orderings, additional modeling of observation processes or usage of weakly supervised approaches may be necessary (Jain et al., 2017).

Computationally, ListMLE is scalable due to linear per-query complexity, efficient batchization, and numerically stable implementations exploiting the log-sum-exp trick. Weighted and margin variants introduce minimal extra cost (Mertan et al., 2020, Zhu et al., 2021).

7. Comparative Table: Standard ListMLE and Selected Variants

Variant Weighting/Modification Primary Application
Standard ListMLE None Generic ranking, document retrieval
Weighted ListMLE Sample or position-dependent weights Engagement maximization, depth est.
Margin-based ListMLE Per-step margin penalty Sentence ordering (BERT4SO)
Generalized Pairwise (ListFold) Pair/pairwise-difference weights Long-short financial strategies

Standard ListMLE is a smooth, probabilistically principled loss, while its variants enable extra expressivity by emphasizing importance, margin, or structural properties suited to specific application needs. Each arises from the core PL likelihood, adapted for scenario-specific user objectives or constraints.


References: (Zhang et al., 2021, Zhu et al., 2021, Mertan et al., 2020, Xia et al., 2019, Jain et al., 2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ListMLE Loss.