Papers
Topics
Authors
Recent
2000 character limit reached

Logit-Swap Self-Distillation (SLD)

Updated 2 December 2025
  • The paper introduces SLD as a novel approach that swaps logits when the true class is under-ranked to correct misaligned teacher predictions.
  • SLD employs a bi-level teacher alignment mechanism by leveraging both the original teacher and a pseudo-teacher derived from the student for robust distillation.
  • Empirical results on CIFAR-100 and ImageNet demonstrate consistent accuracy improvements over standard KD methods with minimal computational overhead.

Logit-Swap Self-Distillation (SLD) is a knowledge distillation methodology designed to address erroneous knowledge transfer arising in standard logit-based distillation. Conventional frameworks train a compact student model by aligning its output distribution with that of a larger teacher, but if the teacher assigns excessive confidence to the wrong class, this alignment can propagate systematic mistakes. SLD employs swapped logit processing: whenever the true class logit is under-ranked, logits are exchanged to ensure the true class receives the highest score, thereby correcting the ranking while preserving original logit structure. This approach simultaneously leverages the original teacher and a pseudo-teacher derived from the student itself, forming a bi-level distillation scheme. Empirical results indicate robust improvements over previous logit-based and feature-based distillation baselines across diverse architectures and datasets (Limantoro et al., 27 Apr 2025).

1. Theoretical Foundations and Motivation

SLD is motivated by two primary deficiencies in vanilla knowledge distillation (KD): error propagation when the teacher predicts incorrectly, and the inadequacy of existing fixes which either distort the logit distribution or lose informative non-target context. In classical logit KD, the loss minimized is KL(softmax(zT/T)    softmax(zS/T))\mathrm{KL}(\mathrm{softmax}(z^T/T)\;\|\;\mathrm{softmax}(z^S/T)), where zTz^T and zSz^S denote teacher and student logits, respectively. Erroneous KD occurs when argmaxj(zjT)t\arg\max_j(z^T_j)\neq t, resulting in the student being misled by the teacher’s “dark knowledge.” Attempts to repair this—such as adding a constant to the true class logit—either force unnaturally peaky distributions or degrade valuable secondary information. SLD is constructed on two assumptions: (1) incorrect prediction arises when the true-class logit is not maximal, and (2) there is no optimal additive correction to the true-class logit that preserves the integrity of the original distribution.

The principle underpinning SLD is the logit swap: if the true class is under-ranked, exchange its logit with that of the highest non-target class. This adjustment ensures correct ordering, preserves “natural” logit capacity, and maintains the distribution over non-target classes.

2. Formalization of Logit-Swap Operators

Let zT,zSRCz^T, z^S\in\mathbb{R}^C denote teacher and student logits, with ground‐truth index tt. For a set of temperatures {T1,,TK}\{T_1,\ldots,T_K\}, multi-temperature softmax distributions pj(z;Tk)=exp(zj/Tk)/c=1Cexp(zc/Tk)p_j(z;T_k)=\exp(z_j/T_k)/\sum_{c=1}^C \exp(z_c/T_k) are constructed (“prediction augmentation,” K=6K=6 in experiments).

Teacher-Swap

For each TkT_k, compute pkT=softmax(zT;Tk)p^T_k=\mathrm{softmax}(z^T;T_k). If argmaxjpk,jTt\arg\max_j\,p^T_{k,j}\neq t, let n=argmaxjtpk,jTn=\arg\max_{j\neq t} p^T_{k,j} and swap entries at tt and nn: set p~k,tT=pk,nT\tilde{p}^T_{k,t}=p^T_{k,n}, p~k,nT=pk,tT\tilde{p}^T_{k,n}=p^T_{k,t}, p~k,jT=pk,jT\tilde{p}^T_{k,j}=p^T_{k,j} for j{t,n}j\notin\{t,n\}. The teacher-swap loss is

LTS=k=1KKL(p~kT    pkS).L_{TS} = \sum_{k=1}^K \mathrm{KL}(\tilde{p}^T_k\;\|\;p^S_k).

Student-Swap (Pseudo-Teacher)

Analogously, for each TkT_k compute pkS=softmax(zS;Tk)p^S_k=\mathrm{softmax}(z^S;T_k), swap if argmaxjpk,jSt\arg\max_j\,p^S_{k,j}\neq t. The student-swap loss is

LSS=k=1KKL(p~kS    pkS).L_{SS} = \sum_{k=1}^K \mathrm{KL}(\tilde{p}^S_k\;\|\;p^S_k).

LSSL_{SS} acts as an alignment between the student and its self-corrected version, functioning as “pseudo-teacher” supervision.

3. Bi-level Teacher Alignment Mechanism

SLD orchestrates two concurrent teaching signals: (1) from the original teacher with swapped (where necessary) logits (LTSL_{TS}), and (2) from the student’s own swapped pseudo-teacher (LSSL_{SS}). Both signals operate strictly at the logit instance level, requiring no extra feature-level modules. The aggregate distillation loss is scheduled as

Ldistill(t)=LTS(t)+α(t)LSS(t),L_\mathrm{distill}(t) = L_{TS}(t) + \alpha(t)L_{SS}(t),

where α(t)=0\alpha(t)=0 for epochs tγt\le \gamma and α(t)=1\alpha(t)=1 thereafter. The student’s overall training loss combines cross-entropy on hard labels and weighted distillation loss.

4. Loss Scheduling and Training Dynamics

During early epochs, student logits are typically unstable; immediate application of LSSL_{SS} can dominate gradient dynamics detrimentally. SLD mitigates this via gating: LSS=0L_{SS}=0 for tγt\le \gamma (“warm-up” γ=150\gamma=150 on CIFAR-100, γ=30\gamma=30 on ImageNet), and only activates self-distillation signal after this period. This staged introduction ensures stability and mutually consistent teacher/pseudo-teacher alignment.

This approach is crucial; naïve simultaneous use of LSSL_{SS} from initialization degrades accuracy by >5% in heterogeneous scenarios. Furthermore, ablation studies reveal that removing either prediction augmentation or loss scheduling substantially compromises performance (2–5% drop).

5. Implementation Workflow

The SLD algorithm, as described in PyTorch-style pseudo-code, comprises the following steps:

  1. Clone teacher (zTz^T) and student (zSz^S) logits.
  2. Identify highest non-target logit index (nTn_T for teacher, nSn_S for student); if the true class logit is not maximal, swap the values at these indices.
  3. For each temperature TkT_k, compute softmax distributions for original and swapped logits. Accumulate KL divergence losses for teacher-swap (LTSL_{TS}) and, if after the warm-up epoch, for student-swap (LSSL_{SS}).
  4. Combine KL losses with cross-entropy and backpropagate.

The following table summarizes operational distinctions among baseline KD variants and SLD:

Method Logit Manipulation Self-Distillation Feature Modules
KD (vanilla) None No No
LS-MLKD Multi-temperature, LS No No
SLD Teacher + Student Swap Yes No

LS-MLKD: Label Smoothing Multi-Temperature KD.

6. Empirical Evaluation and Benchmarks

Experiments are conducted on CIFAR-100 (100 classes), ImageNet-1K (1000 classes), in both homogeneous (e.g., ResNet\rightarrowResNet) and heterogeneous (e.g., ResNet\rightarrowMobileNet, ShuffleNet) teacher-student configurations. Baselines include KD, DKD, MLKD, LS-MLKD, and feature-distillation methods CRD, OFD, ReviewKD, CAT-KD.

Results Overview

  • CIFAR-100 Homogeneous (mean over six pairs): KD: 73.09%, DKD: 74.69%, MLKD: 75.09%, LS-MLKD: 75.44%, SLD: 75.64%.
  • CIFAR-100 Heterogeneous (mean over five pairs): KD: 71.60%, DKD: 74.69%, MLKD: 74.93%, LS-MLKD: 75.15%, SLD: 75.24%.
  • ImageNet-1K:
    • ResNet34\rightarrowResNet18: KD: Top-1/Top-5 = 70.66/89.88; LS-MLKD: 72.08/90.74; SLD: 72.15/90.90.
    • ResNet50\rightarrowMobileNetV1: KD: 68.58/88.98; LS-MLKD: 73.22/91.59; SLD: 73.27/91.65.

Training efficiency is near-parity: SLD incurs zero extra parameters and per-batch time (CIFAR-100 Res32×4\rightarrowRes8×4) is 19ms vs. KD 18ms, DKD 20ms, MLKD 25ms.

Ablation confirms performance drops of 2–5% if prediction-augmentation or L_{SS} is omitted. Restricting swaps based on the true-target gap does not yield improvements beyond the argmaxtargmax\neq t rule.

7. Significance, Limitations, and Extension Prospects

SLD addresses essential vulnerabilities in logit-based distillation, correcting misclassified knowledge transfer through a minimal, distribution-preserving swap operation. The introduction of bi-level supervision, loss scheduling, and prediction augmentation results in consistent empirical superiority with negligible computational overhead. No hyperparameter tuning of swapping thresholds or extra capacity is required.

This suggests SLD offers a principled mechanism comfortable for plug-and-play adoption in existing KD pipelines with diverse architectures and tasks. Limitations include potential misalignments in cases of highly ambiguous logit distributions or distributions with multiple classes sharing maximal probability. A plausible implication is that further investigation may yield refinements for multi-class swaps or adaptation to non-classification tasks.

For rigorous implementation and comparative experimentation, see (Limantoro et al., 27 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Logit-Swap Self-Distillation.