Papers
Topics
Authors
Recent
2000 character limit reached

Progressive Self-Knowledge Distillation (PS-KD)

Updated 25 December 2025
  • PS-KD is a regularization method that adapts training targets by gradually blending one-hot labels with the model's prior predictions to reduce overfitting and overconfidence.
  • It employs a linear schedule to transition from hard targets to soft targets, effectively incorporating implicit hard-example mining via gradient rescaling.
  • Empirical results across image classification, object detection, and machine translation tasks indicate notable improvements in accuracy, calibration, and ranking quality.

Progressive Self-Knowledge Distillation (PS-KD) is a regularization method for supervised deep learning models that adaptively blends one-hot ground truth targets with the model’s own past predictions. It progressively distills a network's own outputs to soften hard targets over the course of training, resulting in improved generalization, enhanced calibration, and superior ranking quality across a range of tasks including image classification, object detection, and machine translation. PS-KD is model-agnostic, easily combinable with other regularization methods, and requires only a single hyperparameter to control the mixing schedule (Kim et al., 2020).

1. Motivation and Background

Supervised deep neural networks trained with hard one-hot targets are susceptible to overconfidence and overfitting, particularly in settings lacking data diversity or with overparameterized architectures. Conventional label smoothing uniformly softens targets but applies a static transformation, which may conflict with adaptive or sample-dependent regularizers such as CutMix. PS-KD is introduced to address these limitations by allowing the model to act as its own teacher, progressively using its prior predictions to refine training targets as learning progresses. This provides strong regularization without requiring auxiliary teacher models, additional architecture changes, or hand-crafted difficulty weighting.

2. Adaptive Target Refinement in PS-KD

For input xx with one-hot ground truth label y{e1,,eK}y\in\{e_1,\ldots,e_K\}, the model’s softmax output at training epoch tt is pt(x)=(pt,1,,pt,K)p_t(x) = (p_{t,1},\ldots,p_{t,K}). The target refinement mechanism is formalized as:

yt(x)=(1λt)y+λtpt1(x)y_t(x) = (1-\lambda_t) \cdot y + \lambda_t \cdot p_{t-1}(x)

with λt[0,1]\lambda_t \in [0,1] controlling the trust in the previous epoch’s predictions pt1(x)p_{t-1}(x). λt\lambda_t typically follows a linear schedule:

λt=λTtT\lambda_t = \lambda_T \cdot \frac{t}{T}

where TT is the total number of training epochs and λT\lambda_T is the final interpolation weight. This produces a gradual transition from hard, one-hot targets to a blend that becomes increasingly influenced by the model’s own predictive distribution as training converges.

3. Loss Function and Optimization

At each epoch tt, the PS-KD loss is:

LKD,t(x,y)=H(yt(x),pt(x))=i=1Kyt,i(x)logpt,i(x)L_{KD, t}(x, y) = H(y_t(x), p_t(x)) = -\sum_{i=1}^K y_{t,i}(x) \log p_{t,i}(x)

Substituting for yt(x)y_t(x), the loss becomes:

LKD,t=i[(1λt)yi+λtpt1,i]logpt,iL_{KD, t} = -\sum_i [(1-\lambda_t) y_i + \lambda_t p_{t-1, i}] \log p_{t, i}

Optimization proceeds using standard stochastic gradient descent (SGD) or other optimizers. The algorithm operates as follows:

  • For each epoch, compute the current λt\lambda_t.
  • For each training sample, build the refined target yt(x)y_t(x) by linearly combining ground truth yy and cached pt1(x)p_{t-1}(x).
  • Perform forward pass, compute loss, update model weights.
  • Update the cache per-sample to store pt(x)p_t(x) for use in the next epoch.

Training may store the entire cache in memory (or on disk, if resources are constrained), and only a single additional hyperparameter, λT\lambda_T, is introduced. No modifications to model architecture or augmenting networks are necessary.

4. Implicit Hard-Example Mining via Gradient Rescaling

PS-KD automatically implements a hard-example mining effect by rescaling the contribution of each example to the gradient. The per-class logit gradient at epoch tt is:

LKD,tzi=(1λt)(pt,iyi)+λt(pt,ipt1,i)\frac{\partial L_{KD, t}}{\partial z_i} = (1-\lambda_t)(p_{t,i} - y_i) + \lambda_t(p_{t,i} - p_{t-1,i})

For the ground-truth class (GTGT), the total gradient magnitude is rescaled by:

rt=1λt1pt1,GT1pt,GTr_t = 1 - \lambda_t \frac{1 - p_{t-1,GT}}{1 - p_{t,GT}}

Harder examples (with smaller pt1,GTp_{t-1,GT}) experience smaller reductions in gradient magnitude and are thus emphasized. This mechanism ensures that as λt\lambda_t increases, focus shifts toward examples the model classifies with less confidence, directly integrating hard-example mining without explicit scheduling or additional sample weighting.

5. Compatibility with Existing Regularization Techniques

PS-KD is compatible and additive with a broad spectrum of existing regularization methods, including:

  • Data augmentations (e.g., Cutout, Mixup, CutMix, AugMix)
  • Architectural regularizers (e.g., dropout, weight decay, batch normalization)
  • Label-level smoothing techniques (e.g., Label Smoothing, DisturbLabel)

PS-KD can be applied exclusively, to a subset of a minibatch, or in tandem with other methods (e.g., half CutMix, half PS-KD). Empirically, combinations such as CutMix+PS-KD outperform either method used in isolation across standard datasets and architectures.

6. Empirical Performance Across Benchmarks

The application of PS-KD yields consistent improvements in accuracy, calibration, and ranking metrics for various computer vision and language tasks:

Task & Model Baseline + PS-KD + PS-KD & CutMix
CIFAR-100 (ResNet, etc.) Top-1 Err: Baseline −1–2% vs baseline Additional −0.5–1%
ECE: 10–12% <5% (often <2%)
AURC: Baseline 10–20% ↓
ImageNet (ResNet-152) 22.19% Top-1 21.41% 20.76%
ECE: Baseline −2–3 pts <1%
PASCAL VOC (Faster R-CNN) 78.3% mAP 79.5% 79.7%
MT (IWSLT15, Multi30K) EN→DE: 28.5 BLEU 30.0
DE→EN: 34.6 BLEU 36.2
Multi30K DE→EN:29.0 32.3

ECE (Expected Calibration Error) and AURC (Area Under Risk Coverage) decrease substantially, indicating improved model confidence metrics and ranking performance. In summary, PS-KD provides a lightweight and effective alternative to traditional knowledge distillation frameworks and complements both architectural and augmentation-based regularizers (Kim et al., 2020).

7. Implementation Considerations and Hyperparameterization

PS-KD requires minimal implementation overhead. The primary consideration is maintaining per-sample caches for pt1(x)p_{t-1}(x). This can be addressed by either maintaining two model snapshots in GPU memory (one for teacher, one for student inference) or by storing predictions on disk and recomputing as needed. In practice, only Pprev[x]P_{prev}[x] (the previous soft label for xx) must be stored; memory constraints can be mitigated by recomputing or discarding older predictions as necessary.

The algorithm introduces a single additional hyperparameter, λT\lambda_T, controlling the final target blend ratio. This parameter is easy to tune and typically follows a linear schedule maximizing λT\lambda_T at the final epoch. No additional changes to model, optimizer, or computational processes are required beyond standard supervised workflows.

PS-KD is thus positioned as an accessible, generic, and robust method for enhanced network regularization across vision and language domains (Kim et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Progressive Self-Knowledge Distillation (PS-KD).