Plackett-Luce Distillation (PLD)
- The paper demonstrates that PLD transfers the full teacher ranking with confidence weighting, achieving improved Top-1 accuracy over KD and DIST approaches.
- PLD is a knowledge distillation framework that unifies cross-entropy loss and list-wise ranking through a convex, translation-invariant surrogate loss.
- Empirical evaluations reveal that PLD consistently enhances performance in both homogeneous and heterogeneous distillation setups without extra hyperparameter tuning.
Plackett-Luce Distillation (PLD) is a knowledge distillation framework that recasts model compression as a list-wise ranking problem under the Plackett-Luce (PL) probability model. In PLD, the compact student network is trained to replicate not only the predictive behavior but also the full confidence-weighted ranking of its larger teacher network, as determined by the teacher’s logits. PLD unifies cross-entropy and list-wise ranking, constructing a convex, translation-invariant surrogate loss that efficiently transfers the teacher’s complete output ordering and relative confidences—without requiring the standard cross-entropy/distillation mix-weight tuning. Empirical evaluations on standard image classification benchmarks demonstrate that PLD consistently improves student Top-1 accuracy over both classic Kullback–Leibler (KD) and advanced correlation-based (DIST) distillation approaches (Bassam et al., 14 Jun 2025).
1. Choice-Theoretic Foundation and the Plackett-Luce Model
The PLD framework adopts a choice-theoretic perspective grounded in the Plackett-Luce (PL) model. In PL, each class is assigned a positive "worth" , where is the logit for class . The probability of generating a full class ranking is given by:
This probability model is invariant to translation of logits (Luce’s Choice Axiom). In contrast, classic cross-entropy loss only enforces the true class’s dominance at the top (i.e., at step ), ignoring the order and confidence structure of the other classes. PLD leverages the PL model to transfer the complete teacher ranking to the student.
2. The PLD Loss: Construction and Special Cases
PLD constructs a "teacher-optimal" permutation by placing the ground-truth label first, followed by the remaining labels sorted in descending teacher logit order:
The unweighted surrogate, ListMLE, minimizes the negative log-probability of under the student’s logits:
PLD introduces confidence-weighting: each position in is weighted by the teacher’s softmax mass, , where
The PLD loss is then defined as:
Convexity is preserved by the non-negative weighting and the convexity of each summand. Special cases include the reduction to standard cross-entropy when only is nonzero and ListMLE under uniform weights.
3. Algorithmic Implementation
The algorithmic workflow for PLD in a minibatch of examples involves the following steps:
- Compute teacher logits and student logits .
- For each example:
- Compute teacher softmax scores .
- Construct permutation : true label first, then remaining classes by descending teacher logit.
- Gather student logits .
- Compute prefix log-cumulative sums .
- Compute per-position loss , weight by .
- Sum losses and average over . Backpropagate.
The additional computation is dominated by sorting at per example. On datasets such as ImageNet (), batched GPU implementations render this overhead negligible.
4. Empirical Results on Standard Benchmarks
Experiments conducted on ImageNet-1K with students (ResNet-50, ViT-Small) distilled from various teachers (ResNet, MobileNet-v4, ViT) over 100 epochs, using identical optimization and data pipelines for KD, DIST, and PLD, show the following Top-1 improvements:
- Homogeneous student/teacher architectures: PLD improves Top-1 by an average of +0.42% over DIST (Huang et al., 2022), and +1.04% over KD (Hinton et al., 2015). For instance, in the ResNet-152→ResNet-50 case, KD achieves 76.80%, DIST 76.60%, and PLD 77.30%.
- Heterogeneous distillation setups: PLD yields gains of +0.48% vs DIST and +1.09% vs KD on average. The largest single gain vs KD is +1.55% for MobileNet-v4 Conv-Large, and +0.70% vs DIST for ResNet-152.
- Extended training (100→300 epochs): PLD retains its empirical gains, with improvements similar to or slightly exceeding those seen in KD and DIST for both short and long training schedules (Bassam et al., 14 Jun 2025).
5. Comparison with Previous Distillation Methods
| Aspect | KD (Hinton et al., 2015) | DIST (Huang et al., 2022) | PLD (Bassam et al., 14 Jun 2025) |
|---|---|---|---|
| Loss Structure | CE + distill (KL divergence) | CE + Pearson-correlation on logits | Single list-wise ranking loss |
| Tuning Required | Requires , | Requires a mix-weight, CE vs distillation | No auxiliary mix-weight |
| Rank Transfer | Matches marginal probabilities | Matches intra/inter-class correlations | Transfers teacher’s full ranking |
| Convexity | Yes (per term) | Yes (correlation/CE) | Yes (full loss) |
| Translation-Invariant | Yes for CE, by construction | Yes (correlation on logits) | Yes (PL property) |
| Hyperparameters | , | CE-vs-correlation weight | Single temperature |
PLD enforces the true label’s dominance (subsuming cross-entropy) and additionally injects the full teacher ordering, confidence-weighted at each rank, through a single convex, translation-invariant term. Unlike prior methods, PLD requires no auxiliary mixing weight and is robust to the choice of softmax temperature within .
6. Practical Considerations and Extensions
Optimal performance is observed with teacher-softmax temperature , with little sensitivity across . PLD does not introduce a cross-entropy mixing coefficient, has manageable computational overhead, and is compatible with standard optimizers (e.g., LAMB, AdamW, Adan). The framework allows for further extension:
- Curriculum weighting: Annealing across epochs to emphasize either the top-1 label or the full ranking as training progresses.
- Label set adaptation: For tasks with mismatched or partial label sets, restrict the PL permutation to shared classes.
- Task generalization: The PLD framework applies to structured output domains, such as sequence modeling or reinforcement learning, wherever a PL loss is appropriate.
7. Context and Significance
Plackett-Luce Distillation provides a theoretically principled, efficient, and empirically validated mechanism for transferring the entirety of a teacher network’s predictive knowledge to a student. By framing distillation as a list-wise, confidence-weighted ranking task, PLD systematically leverages more nuanced teacher information than both marginal probability-matching and correlation-based approaches. The convexity, translation invariance, and lack of hyperparameter tuning requirements facilitate both practical adoption and theoretical analysis. These properties position PLD as a general-purpose distillation objective, especially relevant in domains where comprehensive teacher guidance yields tangible improvements over marginal or pairwise matching strategies (Bassam et al., 14 Jun 2025).