Progressive Self-Knowledge Distillation (PS-KD)
- PS-KD is a regularization method that adapts training targets by gradually blending one-hot labels with the model's prior predictions to reduce overfitting and overconfidence.
- It employs a linear schedule to transition from hard targets to soft targets, effectively incorporating implicit hard-example mining via gradient rescaling.
- Empirical results across image classification, object detection, and machine translation tasks indicate notable improvements in accuracy, calibration, and ranking quality.
Progressive Self-Knowledge Distillation (PS-KD) is a regularization method for supervised deep learning models that adaptively blends one-hot ground truth targets with the model’s own past predictions. It progressively distills a network's own outputs to soften hard targets over the course of training, resulting in improved generalization, enhanced calibration, and superior ranking quality across a range of tasks including image classification, object detection, and machine translation. PS-KD is model-agnostic, easily combinable with other regularization methods, and requires only a single hyperparameter to control the mixing schedule (Kim et al., 2020).
1. Motivation and Background
Supervised deep neural networks trained with hard one-hot targets are susceptible to overconfidence and overfitting, particularly in settings lacking data diversity or with overparameterized architectures. Conventional label smoothing uniformly softens targets but applies a static transformation, which may conflict with adaptive or sample-dependent regularizers such as CutMix. PS-KD is introduced to address these limitations by allowing the model to act as its own teacher, progressively using its prior predictions to refine training targets as learning progresses. This provides strong regularization without requiring auxiliary teacher models, additional architecture changes, or hand-crafted difficulty weighting.
2. Adaptive Target Refinement in PS-KD
For input with one-hot ground truth label , the model’s softmax output at training epoch is . The target refinement mechanism is formalized as:
with controlling the trust in the previous epoch’s predictions . typically follows a linear schedule:
where is the total number of training epochs and is the final interpolation weight. This produces a gradual transition from hard, one-hot targets to a blend that becomes increasingly influenced by the model’s own predictive distribution as training converges.
3. Loss Function and Optimization
At each epoch , the PS-KD loss is:
Substituting for , the loss becomes:
Optimization proceeds using standard stochastic gradient descent (SGD) or other optimizers. The algorithm operates as follows:
- For each epoch, compute the current .
- For each training sample, build the refined target by linearly combining ground truth and cached .
- Perform forward pass, compute loss, update model weights.
- Update the cache per-sample to store for use in the next epoch.
Training may store the entire cache in memory (or on disk, if resources are constrained), and only a single additional hyperparameter, , is introduced. No modifications to model architecture or augmenting networks are necessary.
4. Implicit Hard-Example Mining via Gradient Rescaling
PS-KD automatically implements a hard-example mining effect by rescaling the contribution of each example to the gradient. The per-class logit gradient at epoch is:
For the ground-truth class (), the total gradient magnitude is rescaled by:
Harder examples (with smaller ) experience smaller reductions in gradient magnitude and are thus emphasized. This mechanism ensures that as increases, focus shifts toward examples the model classifies with less confidence, directly integrating hard-example mining without explicit scheduling or additional sample weighting.
5. Compatibility with Existing Regularization Techniques
PS-KD is compatible and additive with a broad spectrum of existing regularization methods, including:
- Data augmentations (e.g., Cutout, Mixup, CutMix, AugMix)
- Architectural regularizers (e.g., dropout, weight decay, batch normalization)
- Label-level smoothing techniques (e.g., Label Smoothing, DisturbLabel)
PS-KD can be applied exclusively, to a subset of a minibatch, or in tandem with other methods (e.g., half CutMix, half PS-KD). Empirically, combinations such as CutMix+PS-KD outperform either method used in isolation across standard datasets and architectures.
6. Empirical Performance Across Benchmarks
The application of PS-KD yields consistent improvements in accuracy, calibration, and ranking metrics for various computer vision and language tasks:
| Task & Model | Baseline | + PS-KD | + PS-KD & CutMix |
|---|---|---|---|
| CIFAR-100 (ResNet, etc.) | Top-1 Err: Baseline | −1–2% vs baseline | Additional −0.5–1% |
| ECE: 10–12% | <5% (often <2%) | ||
| AURC: Baseline | 10–20% ↓ | ||
| ImageNet (ResNet-152) | 22.19% Top-1 | 21.41% | 20.76% |
| ECE: Baseline | −2–3 pts | <1% | |
| PASCAL VOC (Faster R-CNN) | 78.3% mAP | 79.5% | 79.7% |
| MT (IWSLT15, Multi30K) | EN→DE: 28.5 BLEU | 30.0 | |
| DE→EN: 34.6 BLEU | 36.2 | ||
| Multi30K DE→EN:29.0 | 32.3 |
ECE (Expected Calibration Error) and AURC (Area Under Risk Coverage) decrease substantially, indicating improved model confidence metrics and ranking performance. In summary, PS-KD provides a lightweight and effective alternative to traditional knowledge distillation frameworks and complements both architectural and augmentation-based regularizers (Kim et al., 2020).
7. Implementation Considerations and Hyperparameterization
PS-KD requires minimal implementation overhead. The primary consideration is maintaining per-sample caches for . This can be addressed by either maintaining two model snapshots in GPU memory (one for teacher, one for student inference) or by storing predictions on disk and recomputing as needed. In practice, only (the previous soft label for ) must be stored; memory constraints can be mitigated by recomputing or discarding older predictions as necessary.
The algorithm introduces a single additional hyperparameter, , controlling the final target blend ratio. This parameter is easy to tune and typically follows a linear schedule maximizing at the final epoch. No additional changes to model, optimizer, or computational processes are required beyond standard supervised workflows.
PS-KD is thus positioned as an accessible, generic, and robust method for enhanced network regularization across vision and language domains (Kim et al., 2020).