Papers
Topics
Authors
Recent
Search
2000 character limit reached

SE-KD: Entropy-Guided Distillation

Updated 4 February 2026
  • SE-KD is a knowledge distillation paradigm that uses the student model's entropy to selectively guide teacher supervision in sequence modeling tasks.
  • It reduces computational and memory overhead by focusing on high-uncertainty positions, achieving significant efficiency gains such as a 28% drop in GPU memory usage and a 5.76× speedup.
  • The approach extends to multi-axis sparsification across samples, positions, and classes, ensuring robust performance across diverse benchmarks.

Student-Entropy-Guided Position Selection (SE-KD) is a knowledge distillation paradigm that optimizes supervision in sequence modeling tasks by selecting a subset of training positions based on the entropy of the student model’s predictions. It has emerged as a high-efficiency alternative to dense knowledge distillation for both LLMs and, in analogous formulations, spatial context models for image and video compression. By leveraging the student’s own uncertainty as a selection criterion, SE-KD delivers substantial reductions in computational and memory overhead while retaining or improving downstream task performance (Tavor et al., 1 Feb 2026, Tong et al., 3 Aug 2025).

1. Formal Foundations and Student Entropy as Importance Metric

Student-entropy-guided position selection is rooted in the measurement of prediction uncertainty. Given a training sequence x=(x1,,xL)x=(x_1,\dots,x_L), at each step tt, the student model outputs a next-token distribution: qt(v)=q(vxt),vVq_t(v) = q(v \mid x_{\leq t}), \quad v \in \mathcal{V} The token-level Shannon entropy is

H(qt)=vVqt(v)logqt(v)H(q_t) = -\sum_{v \in \mathcal{V}} q_t(v) \log q_t(v)

This entropy quantifies the uncertainty present in the student’s prediction at position tt. SE-KD utilizes these entropy scores—often normalized—to select a fraction kk of positions per sequence where the student is least certain, hypothesizing that teacher supervision at high-entropy positions is most valuable for maximizing student model improvement (Tavor et al., 1 Feb 2026).

2. Algorithmic Procedure for Position Selection

The SE-KD algorithm operates in the following steps within each training batch:

  1. Run a student model forward pass to obtain logits and compute per-position entropies Ht(i)H_t^{(i)}.
  2. For each sequence ii, sort positions tt by entropy and mark the top ki=k(L1)k_i = \lceil k \cdot (L-1) \rceil positions as supervised (mt(i)=1m_t^{(i)}=1).
  3. Optionally, apply sample and class sparsification.
  4. Compute teacher outputs selectively for the marked positions and restricted class set where applicable.
  5. Backpropagate the knowledge distillation (KD) loss only on selected tokens and classes.

Pseudocode:

1
2
3
4
5
6
for i in batch:
    H = entropy(student_logits[i])
    idx = argsort(H, descending=True)
    m[idx[:k_i]] = 1
    # Compute teacher outputs/predictions only on positions m_t==1
    # Compute KD loss and update on selected positions
Memory and compute cost are dominated by the fraction kk, with chunked entropy computation ensuring no full [B,L,V][B, L, V] tensors are loaded simultaneously (Tavor et al., 1 Feb 2026).

3. Multi-Axis Extensions: Sample, Position, and Class Sparsification

SE-KD can be extended to three orthogonal axes, a variant referred to as SE-KD³ˣ:

  • Sample-axis: Sequences are filtered by their average entropy, retaining the top %\ell\%.
  • Position-axis: Within each sequence, select the top k%k\% highest-entropy positions.
  • Class-axis: For each chosen (sequence, position) pair, sample UU classes from the teacher's distribution, forming a truncated supervision set.

The combined selective KD loss is: L=1isii=1Bsi1tmt(i)t=1Li1mt(i)KLCt(i) ⁣(pt(i)qt(i))\mathcal{L} = \frac{1}{\sum_i s_i}\sum_{i=1}^B s_i\, \frac{1}{\sum_t m_t^{(i)}} \sum_{t=1}^{L_i-1} m_t^{(i)}\, \text{KL}_{\mathcal C_t^{(i)}}\!\bigl(p_t^{(i)}\Vert q_t^{(i)}\bigr) This extension allows fine-grained control over computational efficiency across the dataset and model output space (Tavor et al., 1 Feb 2026).

4. Hyperparameters, Practical Tuning, and Ablation Results

Key hyperparameters for SE-KD tuning include:

  • kk (position-budget): Optimal values cluster at k=20%k=20\%, achieving peak accuracy with most of the efficiency gains. Performance remains robust for kk as low as 1%.
  • \ell (sample-budget): Typically set at 20%, with minimal effect on accuracy and linear runtime improvements.
  • UU (class samples): U=64U=64 provides the bias-variance tradeoff for class-axis sampling.
  • λ\lambda (KD loss weight): Pure KL (λ=1\lambda=1) is favored for efficient gradient backpropagation.
  • Curriculum, window, and queue parameters for baseline comparisons: Explicit curriculum shifting and global percentile baselines do not outperform top-kk entropy selection.

Empirical ablations show that student entropy is a strong selection signal for position importance, outperforming teacher entropy and other metrics. Top-20% SE-KD exceeds dense supervision (full KD) in both accuracy and compute/memory efficiency (Tavor et al., 1 Feb 2026).

5. Implementation Complexity and Resource Utilization

SE-KD provides substantial reductions in computational and memory requirements versus dense KD. For batch size B=2B=2, sequence length L=512L=512, and vocab size V=100KV=100K with k=20%k=20\%:

  • Student peak memory drops from 15.88 GB to 11.42 GB (−28.1%)
  • Teacher peak memory drops by 9.4%
  • Total GPU memory reduction of 18.3%
  • Wall time for 80M tokens: full KD = 22 h 52 m; SE-KD (end-to-end) = 8 h 46 m; with sample selection and offline caching = 3 h 58 m (5.76× speedup)
  • Offline cache storage for 100B tokens: full KD = 10,000 TB; RS-KD (U=64) = 19.2 TB; SE-KD (ℓ=20%) = 3.84 TB (an 80% reduction over RS-KD, nearly 4 orders of magnitude over dense KD) (Tavor et al., 1 Feb 2026)

These resource savings, particularly when combined across axes in SE-KD³ˣ, make previously prohibitive large-scale distillation manageable.

6. Empirical Performance and Task Generalization

SE-KD achieves higher or comparable performance to full KD across multiple benchmarks:

Method Avg. Acc. (%) IFEval (%) PPL ECE
No KD 61.9 19.4 12.2 30.5
Full KD 64.4 20.5 7.3 27.3
SE-KD 64.8 21.4 6.9 27.6

Task-specific distillation with SE-KD demonstrates robust transferability (GSM8K: off-policy 69.5%, on-policy 70.0%), with multi-axis extensions yielding further improvements when combining position and sample sparsification (Tavor et al., 1 Feb 2026).

7. Comparative Context and Extensions

The concept of entropy-guided position selection aligns closely with analogous approaches in image/video compression, e.g., the dependency-weighted spatial context and entropy maps in the Context Guided Transformer (CGT) model (Tong et al., 3 Aug 2025). In CGT, the teacher’s spatial attention and entropy scores are combined to select decoding positions, while SE-KD in the LLM setting leverages the student’s output entropy as the selection principle. Both frameworks aim to optimize the allocation of high-cost teacher supervision to positions where it is most likely to improve model performance per unit cost.

In summary, student-entropy-guided position selection, and its multi-axis SE-KD³ˣ variant, provides a principled and highly efficient mechanism for selective distillation in large models, resulting in significant reductions in computation, memory use, and storage requirements without sacrificing performance or downstream generality (Tavor et al., 1 Feb 2026, Tong et al., 3 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Student-Entropy-Guided Position Selection (SE-KD).