Progressive Self-Knowledge Distillation (PS-KD)

Updated 25 December 2025

PS-KD is a regularization method that adapts training targets by gradually blending one-hot labels with the model's prior predictions to reduce overfitting and overconfidence.
It employs a linear schedule to transition from hard targets to soft targets, effectively incorporating implicit hard-example mining via gradient rescaling.
Empirical results across image classification, object detection, and machine translation tasks indicate notable improvements in accuracy, calibration, and ranking quality.

Progressive Self-Knowledge Distillation (PS-KD) is a regularization method for supervised deep learning models that adaptively blends one-hot ground truth targets with the model’s own past predictions. It progressively distills a network's own outputs to soften hard targets over the course of training, resulting in improved generalization, enhanced calibration, and superior ranking quality across a range of tasks including image classification, object detection, and machine translation. PS-KD is model-agnostic, easily combinable with other regularization methods, and requires only a single hyperparameter to control the mixing schedule (Kim et al., 2020).

1. Motivation and Background

Supervised deep neural networks trained with hard one-hot targets are susceptible to overconfidence and overfitting, particularly in settings lacking data diversity or with overparameterized architectures. Conventional label smoothing uniformly softens targets but applies a static transformation, which may conflict with adaptive or sample-dependent regularizers such as CutMix. PS-KD is introduced to address these limitations by allowing the model to act as its own teacher, progressively using its prior predictions to refine training targets as learning progresses. This provides strong regularization without requiring auxiliary teacher models, additional architecture changes, or hand-crafted difficulty weighting.

For input $x$ with one-hot ground truth label $y\in\{e_1,\ldots,e_K\}$ , the model’s softmax output at training epoch $t$ is $p_t(x) = (p_{t,1},\ldots,p_{t,K})$ . The target refinement mechanism is formalized as:

$y_t(x) = (1-\lambda_t) \cdot y + \lambda_t \cdot p_{t-1}(x)$

with $\lambda_t \in [0,1]$ controlling the trust in the previous epoch’s predictions $p_{t-1}(x)$ . $\lambda_t$ typically follows a linear schedule:

$\lambda_t = \lambda_T \cdot \frac{t}{T}$

where $T$ is the total number of training epochs and $\lambda_T$ is the final interpolation weight. This produces a gradual transition from hard, one-hot targets to a blend that becomes increasingly influenced by the model’s own predictive distribution as training converges.

3. Loss Function and Optimization

At each epoch $t$ , the PS-KD loss is:

$L_{KD, t}(x, y) = H(y_t(x), p_t(x)) = -\sum_{i=1}^K y_{t,i}(x) \log p_{t,i}(x)$

Substituting for $y_t(x)$ , the loss becomes:

$L_{KD, t} = -\sum_i [(1-\lambda_t) y_i + \lambda_t p_{t-1, i}] \log p_{t, i}$

Optimization proceeds using standard stochastic gradient descent (SGD) or other optimizers. The algorithm operates as follows:

For each epoch, compute the current $\lambda_t$ .
For each training sample, build the refined target $y_t(x)$ by linearly combining ground truth $y$ and cached $p_{t-1}(x)$ .
Perform forward pass, compute loss, update model weights.
Update the cache per-sample to store $p_t(x)$ for use in the next epoch.

Training may store the entire cache in memory (or on disk, if resources are constrained), and only a single additional hyperparameter, $\lambda_T$ , is introduced. No modifications to model architecture or augmenting networks are necessary.

4. Implicit Hard-Example Mining via Gradient Rescaling

PS-KD automatically implements a hard-example mining effect by rescaling the contribution of each example to the gradient. The per-class logit gradient at epoch $t$ is:

$\frac{\partial L_{KD, t}}{\partial z_i} = (1-\lambda_t)(p_{t,i} - y_i) + \lambda_t(p_{t,i} - p_{t-1,i})$

For the ground-truth class ( $GT$ ), the total gradient magnitude is rescaled by:

$r_t = 1 - \lambda_t \frac{1 - p_{t-1,GT}}{1 - p_{t,GT}}$

Harder examples (with smaller $p_{t-1,GT}$ ) experience smaller reductions in gradient magnitude and are thus emphasized. This mechanism ensures that as $\lambda_t$ increases, focus shifts toward examples the model classifies with less confidence, directly integrating hard-example mining without explicit scheduling or additional sample weighting.

5. Compatibility with Existing Regularization Techniques

PS-KD is compatible and additive with a broad spectrum of existing regularization methods, including:

Data augmentations (e.g., Cutout, Mixup, CutMix, AugMix)
Architectural regularizers (e.g., dropout, weight decay, batch normalization)
Label-level smoothing techniques (e.g., Label Smoothing, DisturbLabel)

PS-KD can be applied exclusively, to a subset of a minibatch, or in tandem with other methods (e.g., half CutMix, half PS-KD). Empirically, combinations such as CutMix+PS-KD outperform either method used in isolation across standard datasets and architectures.

6. Empirical Performance Across Benchmarks

The application of PS-KD yields consistent improvements in accuracy, calibration, and ranking metrics for various computer vision and language tasks:

Task & Model	Baseline	+ PS-KD	+ PS-KD & CutMix
CIFAR-100 (ResNet, etc.)	Top-1 Err: Baseline	−1–2% vs baseline	Additional −0.5–1%
	ECE: 10–12%	<5% (often <2%)
	AURC: Baseline	10–20% ↓
ImageNet (ResNet-152)	22.19% Top-1	21.41%	20.76%
	ECE: Baseline	−2–3 pts	<1%
PASCAL VOC (Faster R-CNN)	78.3% mAP	79.5%	79.7%
MT (IWSLT15, Multi30K)	EN→DE: 28.5 BLEU	30.0
	DE→EN: 34.6 BLEU	36.2
	Multi30K DE→EN:29.0	32.3

ECE (Expected Calibration Error) and AURC (Area Under Risk Coverage) decrease substantially, indicating improved model confidence metrics and ranking performance. In summary, PS-KD provides a lightweight and effective alternative to traditional knowledge distillation frameworks and complements both architectural and augmentation-based regularizers (Kim et al., 2020).

7. Implementation Considerations and Hyperparameterization

PS-KD requires minimal implementation overhead. The primary consideration is maintaining per-sample caches for $p_{t-1}(x)$ . This can be addressed by either maintaining two model snapshots in GPU memory (one for teacher, one for student inference) or by storing predictions on disk and recomputing as needed. In practice, only $P_{prev}[x]$ (the previous soft label for $x$ ) must be stored; memory constraints can be mitigated by recomputing or discarding older predictions as necessary.

The algorithm introduces a single additional hyperparameter, $\lambda_T$ , controlling the final target blend ratio. This parameter is easy to tune and typically follows a linear schedule maximizing $\lambda_T$ at the final epoch. No additional changes to model, optimizer, or computational processes are required beyond standard supervised workflows.

PS-KD is thus positioned as an accessible, generic, and robust method for enhanced network regularization across vision and language domains (Kim et al., 2020).

PDF Markdown Chat (Pro)

References (1)

Self-Knowledge Distillation with Progressive Refinement of Targets (2020)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Progressive Self-Knowledge Distillation (PS-KD).

Progressive Self-Knowledge Distillation (PS-KD)

1. Motivation and Background

2. Adaptive Target Refinement in PS-KD

3. Loss Function and Optimization

4. Implicit Hard-Example Mining via Gradient Rescaling

5. Compatibility with Existing Regularization Techniques

6. Empirical Performance Across Benchmarks

7. Implementation Considerations and Hyperparameterization

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Progressive Self-Knowledge Distillation (PS-KD)

1. Motivation and Background

2. Adaptive Target Refinement in PS-KD

3. Loss Function and Optimization

4. Implicit Hard-Example Mining via Gradient Rescaling

5. Compatibility with Existing Regularization Techniques

6. Empirical Performance Across Benchmarks

7. Implementation Considerations and Hyperparameterization

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics