ListMLE Loss for Ranking Models

Updated 12 March 2026

ListMLE Loss is a listwise surrogate loss that uses the Plackett–Luce model to optimize full ranking permutations.
It is mathematically principled and convex for linear models, making it effective for search result ranking and recommendation systems.
Extensions such as weighted and margin-based variants enhance performance by addressing application-specific metrics and ranking challenges.

ListMLE Loss is a listwise surrogate loss for learning-to-rank, built on the Plackett–Luce probability model for permutations. Its principal application is to train machine learning models to predict rankings over items in a list (e.g., search results, sentences, portfolio assets), directly optimizing the likelihood of ground-truth permutations observed in training data. The formulation is mathematically grounded, convex (for linear models), and provides a smooth, differentiable proxy for sorting-based metrics such as NDCG, making it the de facto listwise loss in state-of-the-art ranking systems (Xia et al., 2019, Jain et al., 2017).

1. Mathematical Foundations: Plackett–Luce Model and ListMLE Loss

Given a list of $n$ items indexed by $\{1,\dots,n\}$ , with each item $i$ assigned a real-valued score $s_i$ (typically $s_i = f(x_i)$ for an input feature vector $x_i$ and learned function $f$ ), the Plackett–Luce (PL) model defines a probability distribution over permutations $y$ of these $n$ items.

For a ground-truth permutation $y$ , the PL probability is

$P(y|s) = \prod_{i=1}^n \frac{\psi(s_{y(i)})}{\sum_{k=i}^n \psi(s_{y(k)})},$

where $\psi(\cdot)$ is a strictly positive transformation, e.g., $\psi(s) = e^s$ (Zhang et al., 2021, Xia et al., 2019). The ListMLE loss is the negative log-likelihood of this probability:

$L_{\mathrm{ListMLE}}(f; x, y) = -\log P(y|x; f) = -\sum_{i=1}^n \bigg[\log \psi\big(f(x_{y(i)})\big) - \log \sum_{k=i}^n \psi\big(f(x_{y(k)})\big)\bigg].$

This construction ranks the item assigned to position $y(1)$ highest, proceeding recursively down the list, always normalizing the remaining unranked items.

The loss is "listwise" because it is defined over full permutations rather than pairs or individual items, and is a smooth surrogate for permutation-level accuracy (Jain et al., 2017).

2. Properties and Theoretical Guarantees

ListMLE defines a proper probability distribution over permutations via the Plackett–Luce model and is thus a principled maximum-likelihood estimator for ranking tasks (Xia et al., 2019). For linear scoring functions, the negative log-likelihood is convex, ensuring a unique global optimum under $\ell_2$ regularization (Xia et al., 2019).

The original ListMLE is shift-invariant under $f \mapsto f + c$ if and only if $\psi(s) = e^s$ (Zhang et al., 2021). For other choices (e.g., sigmoid), this invariance does not hold. Extensions like the generalized ListFold loss achieve shift invariance for arbitrary $\psi$ by defining the loss in terms of score differences.

Under mild conditions, maximizing the PL likelihood is consistent with optimizing top-K ranking quality measures (e.g., NDCG) (Xia et al., 2019).

3. Algorithmic Implementation and Optimization

For each query, training consists of scoring all candidate items, building the PL probability as above, and minimizing the sum of losses across queries. Efficient computation leverages suffix sums and vectorized log-sum-exp operations; per-query cost is $O(n)$ given proper implementation (Xia et al., 2019). Gradient computation with respect to the scores $s_i$ involves accumulating softmax weights across lists that still contain $i$ :

$\frac{\partial L}{\partial s_u} = -1 + \sum_{j=1}^{p} \frac{e^{s_u}}{\sum_{k=j}^n e^{s_{y(k)}}}$

where $u$ appears in position $p$ (Jain et al., 2017, Xia et al., 2019).

Training proceeds via gradient-based optimizers (SGD, Adam). For boosting variants (e.g., PLRank), the negative gradient with respect to each item's score is used as the pseudo-response (Xia et al., 2019).

4. Variants and Extensions

Several extensions to the canonical ListMLE exist:

Weighted ListMLE: Each log-likelihood term is weighted according to an external score (e.g., engagement, relevance gain). In "Rank-to-engage" (Jain et al., 2017), $L_{\rm wListMLE}(g) = -\sum_{i=1}^N s^{(i)} \log P(\pi^{(i)} | \bX^{(i)}; g)$, with $s^{(i)}$ an observed engagement metric. In relative depth estimation, weights combine a gain function $G(s_{y(i)})=2^{s_{y(i)}}-1$ and a position-dependent discount $D(i)=1/\log(i+1)$ (Mertan et al., 2020).
Margin-based ListMLE: Augments the original loss with margin penalties. For each selection step, additional log terms penalize any incorrect pick whose probability exceeds a margin $\gamma$ . The margin variant is

$\tilde f_i(j) = \log F_{i,j}(j) + \sum_{k=j+1}^{n_i-1} \log(\gamma - F_{i,j}(k))$

with

$F_{i,j}(k) = \frac{e^{z_{i, o_{i,k}^*}}}{\sum_{l=j}^{n_i} e^{z_{i, o_{i,l}^*}}}$

for a BERT-based model score $z_{i,k}$ (Zhu et al., 2021).

Generalized Listwise Losses: Extensions such as the ListFold loss in long-short portfolio construction pick pairs instead of single items at each permutation stage, generalizing the PL model (Zhang et al., 2021). This recovers ListMLE in the single-item case.

5. Empirical Applications

ListMLE has been applied across information retrieval, financial modeling, sentence ordering, depth estimation, and recommendation tasks:

Document and Query Ranking: Used on large-scale datasets such as Yahoo LTR 2010 and Microsoft 30K, where ListMLE-based models match or exceed state-of-the-art systems using only listwise information (Xia et al., 2019).
Sentence Ordering: In BERT4SO, ListMLE and its margin-based extension are used for neural sentence ordering. Margin-based ListMLE yields better convergence and accuracy on small datasets, with comparable performance to unmodified ListMLE on large benchmarks (Zhu et al., 2021).
Relative Depth Estimation: Weighted ListMLE enables pixel-wise ordering for depth from single images. Weighted formulations penalize top-of-list errors more via gain and discount weighting, outperforming pairwise ordinal losses in mean average precision metrics (Mertan et al., 2020).
Long-Short Portfolio Construction: In cross-sectional finance, ListMLE is generalized to pairwise Plackett–Luce losses (ListFold), enabling models to optimize both extremities of the ranking for robust long-short strategies (Zhang et al., 2021).

6. Key Limitations and Considerations

ListMLE requires ground-truth permutations (ties typically broken by random or enumerating orderings); it does not enumerate all $n!$ possible permutations per training example—instead, it evaluates only the observed permutation via the recursive PL decomposition (Zhu et al., 2021, Xia et al., 2019).

Its lack of shift-invariance (except under exponential $\psi$ ) may restrict some applications unless generalized losses are used (Zhang et al., 2021). In datasets where only relative or implicit preference data is observed rather than full ground-truth orderings, additional modeling of observation processes or usage of weakly supervised approaches may be necessary (Jain et al., 2017).

Computationally, ListMLE is scalable due to linear per-query complexity, efficient batchization, and numerically stable implementations exploiting the log-sum-exp trick. Weighted and margin variants introduce minimal extra cost (Mertan et al., 2020, Zhu et al., 2021).

7. Comparative Table: Standard ListMLE and Selected Variants

Variant	Weighting/Modification	Primary Application
Standard ListMLE	None	Generic ranking, document retrieval
Weighted ListMLE	Sample or position-dependent weights	Engagement maximization, depth est.
Margin-based ListMLE	Per-step margin penalty	Sentence ordering (BERT4SO)
Generalized Pairwise (ListFold)	Pair/pairwise-difference weights	Long-short financial strategies

Standard ListMLE is a smooth, probabilistically principled loss, while its variants enable extra expressivity by emphasizing importance, margin, or structural properties suited to specific application needs. Each arises from the core PL likelihood, adapted for scenario-specific user objectives or constraints.

References: (Zhang et al., 2021, Zhu et al., 2021, Mertan et al., 2020, Xia et al., 2019, Jain et al., 2017)