Papers
Topics
Authors
Recent
2000 character limit reached

Plackett-Luce Distillation (PLD)

Updated 20 November 2025
  • The paper demonstrates that PLD transfers the full teacher ranking with confidence weighting, achieving improved Top-1 accuracy over KD and DIST approaches.
  • PLD is a knowledge distillation framework that unifies cross-entropy loss and list-wise ranking through a convex, translation-invariant surrogate loss.
  • Empirical evaluations reveal that PLD consistently enhances performance in both homogeneous and heterogeneous distillation setups without extra hyperparameter tuning.

Plackett-Luce Distillation (PLD) is a knowledge distillation framework that recasts model compression as a list-wise ranking problem under the Plackett-Luce (PL) probability model. In PLD, the compact student network is trained to replicate not only the predictive behavior but also the full confidence-weighted ranking of its larger teacher network, as determined by the teacher’s logits. PLD unifies cross-entropy and list-wise ranking, constructing a convex, translation-invariant surrogate loss that efficiently transfers the teacher’s complete output ordering and relative confidences—without requiring the standard cross-entropy/distillation mix-weight tuning. Empirical evaluations on standard image classification benchmarks demonstrate that PLD consistently improves student Top-1 accuracy over both classic Kullback–Leibler (KD) and advanced correlation-based (DIST) distillation approaches (Bassam et al., 14 Jun 2025).

1. Choice-Theoretic Foundation and the Plackett-Luce Model

The PLD framework adopts a choice-theoretic perspective grounded in the Plackett-Luce (PL) model. In PL, each class ii is assigned a positive "worth" wi=esiw_i = e^{s_i}, where sis_i is the logit for class ii. The probability of generating a full class ranking π=(π1,...,πC)\pi = (\pi_1, ..., \pi_C) is given by:

PPL(πs)=k=1Cexp(sπk)l=kCexp(sπl)P_{\mathrm{PL}}(\pi | s) = \prod_{k=1}^{C} \frac{\exp(s_{\pi_k})}{\sum_{l=k}^C \exp(s_{\pi_l})}

This probability model is invariant to translation of logits (Luce’s Choice Axiom). In contrast, classic cross-entropy loss only enforces the true class’s dominance at the top (i.e., at step k=1k=1), ignoring the order and confidence structure of the other classes. PLD leverages the PL model to transfer the complete teacher ranking to the student.

2. The PLD Loss: Construction and Special Cases

PLD constructs a "teacher-optimal" permutation π\pi^* by placing the ground-truth label first, followed by the remaining labels sorted in descending teacher logit order:

π=(y,  argsort(t){y})\pi^* = (y, \; \mathrm{argsort}(t)\setminus\{y\})

The unweighted surrogate, ListMLE, minimizes the negative log-probability of π\pi^* under the student’s logits:

LListMLE(s;π)=k=1Clogexp(sπk)l=kCexp(sπl)\mathcal{L}_{\mathrm{ListMLE}}(s; \pi^*) = -\sum_{k=1}^{C} \log\frac{\exp(s_{\pi^*_k})}{\sum_{l=k}^C \exp(s_{\pi^*_l})}

PLD introduces confidence-weighting: each position kk in π\pi^* is weighted by the teacher’s softmax mass, αk=qπkT\alpha_k = q^T_{\pi^*_k}, where

qiT=exp(ti/τT)jexp(tj/τT)q^T_i = \frac{\exp(t_i/\tau_T)}{\sum_j \exp(t_j/\tau_T)}

The PLD loss is then defined as:

LPLD(s,t;y)=k=1Cαk[sπk+logl=kCesπl]\boxed{ \mathcal{L}_{\mathrm{PLD}}(s, t; y) = \sum_{k=1}^C \alpha_k \left[ -s_{\pi^*_k} + \log\sum_{l=k}^C e^{s_{\pi^*_l}} \right] }

Convexity is preserved by the non-negative weighting and the convexity of each summand. Special cases include the reduction to standard cross-entropy when only α1\alpha_1 is nonzero and ListMLE under uniform weights.

3. Algorithmic Implementation

The algorithmic workflow for PLD in a minibatch of NN examples involves the following steps:

  1. Compute teacher logits tRN×Ct \in \mathbb{R}^{N \times C} and student logits sRN×Cs \in \mathbb{R}^{N \times C}.
  2. For each example:
    • Compute teacher softmax scores qn,iTq^T_{n,i}.
    • Construct permutation πn\pi^*_n: true label first, then remaining classes by descending teacher logit.
    • Gather student logits sπs_{\pi^*}.
    • Compute prefix log-cumulative sums k=logl=kCesπl\ell_k = \log{\sum_{l=k}^C e^{s_{\pi^*_l}}}.
    • Compute per-position loss (ksπk)(\ell_k - s_{\pi^*_k}), weight by αk\alpha_k.
  3. Sum losses and average over NN. Backpropagate.

The additional computation is dominated by sorting at O(ClogC)O(C \log C) per example. On datasets such as ImageNet (C=1000C=1000), batched GPU implementations render this overhead negligible.

4. Empirical Results on Standard Benchmarks

Experiments conducted on ImageNet-1K with students (ResNet-50, ViT-Small) distilled from various teachers (ResNet, MobileNet-v4, ViT) over 100 epochs, using identical optimization and data pipelines for KD, DIST, and PLD, show the following Top-1 improvements:

  • Homogeneous student/teacher architectures: PLD improves Top-1 by an average of +0.42% over DIST (Huang et al., 2022), and +1.04% over KD (Hinton et al., 2015). For instance, in the ResNet-152→ResNet-50 case, KD achieves 76.80%, DIST 76.60%, and PLD 77.30%.
  • Heterogeneous distillation setups: PLD yields gains of +0.48% vs DIST and +1.09% vs KD on average. The largest single gain vs KD is +1.55% for MobileNet-v4 Conv-Large, and +0.70% vs DIST for ResNet-152.
  • Extended training (100→300 epochs): PLD retains its empirical gains, with improvements similar to or slightly exceeding those seen in KD and DIST for both short and long training schedules (Bassam et al., 14 Jun 2025).

5. Comparison with Previous Distillation Methods

Aspect KD (Hinton et al., 2015) DIST (Huang et al., 2022) PLD (Bassam et al., 14 Jun 2025)
Loss Structure CE + distill (KL divergence) CE + Pearson-correlation on logits Single list-wise ranking loss
Tuning Required Requires α\alpha, τ\tau Requires a mix-weight, CE vs distillation No auxiliary mix-weight
Rank Transfer Matches marginal probabilities Matches intra/inter-class correlations Transfers teacher’s full ranking
Convexity Yes (per term) Yes (correlation/CE) Yes (full loss)
Translation-Invariant Yes for CE, by construction Yes (correlation on logits) Yes (PL property)
Hyperparameters α\alpha, τ\tau CE-vs-correlation weight Single temperature τT\tau_T

PLD enforces the true label’s dominance (subsuming cross-entropy) and additionally injects the full teacher ordering, confidence-weighted at each rank, through a single convex, translation-invariant term. Unlike prior methods, PLD requires no auxiliary mixing weight and is robust to the choice of softmax temperature within [0.5,1.5][0.5, 1.5].

6. Practical Considerations and Extensions

Optimal performance is observed with teacher-softmax temperature τT1.0\tau_T \approx 1.0, with little sensitivity across [0.5,1.5][0.5, 1.5]. PLD does not introduce a cross-entropy mixing coefficient, has manageable computational overhead, and is compatible with standard optimizers (e.g., LAMB, AdamW, Adan). The framework allows for further extension:

  • Curriculum weighting: Annealing αk\alpha_k across epochs to emphasize either the top-1 label or the full ranking as training progresses.
  • Label set adaptation: For tasks with mismatched or partial label sets, restrict the PL permutation to shared classes.
  • Task generalization: The PLD framework applies to structured output domains, such as sequence modeling or reinforcement learning, wherever a PL loss is appropriate.

7. Context and Significance

Plackett-Luce Distillation provides a theoretically principled, efficient, and empirically validated mechanism for transferring the entirety of a teacher network’s predictive knowledge to a student. By framing distillation as a list-wise, confidence-weighted ranking task, PLD systematically leverages more nuanced teacher information than both marginal probability-matching and correlation-based approaches. The convexity, translation invariance, and lack of hyperparameter tuning requirements facilitate both practical adoption and theoretical analysis. These properties position PLD as a general-purpose distillation objective, especially relevant in domains where comprehensive teacher guidance yields tangible improvements over marginal or pairwise matching strategies (Bassam et al., 14 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Plackett-Luce Distillation (PLD).