SE-KD: Entropy-Guided Distillation
- SE-KD is a knowledge distillation paradigm that uses the student model's entropy to selectively guide teacher supervision in sequence modeling tasks.
- It reduces computational and memory overhead by focusing on high-uncertainty positions, achieving significant efficiency gains such as a 28% drop in GPU memory usage and a 5.76× speedup.
- The approach extends to multi-axis sparsification across samples, positions, and classes, ensuring robust performance across diverse benchmarks.
Student-Entropy-Guided Position Selection (SE-KD) is a knowledge distillation paradigm that optimizes supervision in sequence modeling tasks by selecting a subset of training positions based on the entropy of the student model’s predictions. It has emerged as a high-efficiency alternative to dense knowledge distillation for both LLMs and, in analogous formulations, spatial context models for image and video compression. By leveraging the student’s own uncertainty as a selection criterion, SE-KD delivers substantial reductions in computational and memory overhead while retaining or improving downstream task performance (Tavor et al., 1 Feb 2026, Tong et al., 3 Aug 2025).
1. Formal Foundations and Student Entropy as Importance Metric
Student-entropy-guided position selection is rooted in the measurement of prediction uncertainty. Given a training sequence , at each step , the student model outputs a next-token distribution: The token-level Shannon entropy is
This entropy quantifies the uncertainty present in the student’s prediction at position . SE-KD utilizes these entropy scores—often normalized—to select a fraction of positions per sequence where the student is least certain, hypothesizing that teacher supervision at high-entropy positions is most valuable for maximizing student model improvement (Tavor et al., 1 Feb 2026).
2. Algorithmic Procedure for Position Selection
The SE-KD algorithm operates in the following steps within each training batch:
- Run a student model forward pass to obtain logits and compute per-position entropies .
- For each sequence , sort positions by entropy and mark the top positions as supervised ().
- Optionally, apply sample and class sparsification.
- Compute teacher outputs selectively for the marked positions and restricted class set where applicable.
- Backpropagate the knowledge distillation (KD) loss only on selected tokens and classes.
Pseudocode:
1 2 3 4 5 6 |
for i in batch: H = entropy(student_logits[i]) idx = argsort(H, descending=True) m[idx[:k_i]] = 1 # Compute teacher outputs/predictions only on positions m_t==1 # Compute KD loss and update on selected positions |
3. Multi-Axis Extensions: Sample, Position, and Class Sparsification
SE-KD can be extended to three orthogonal axes, a variant referred to as SE-KD³ˣ:
- Sample-axis: Sequences are filtered by their average entropy, retaining the top .
- Position-axis: Within each sequence, select the top highest-entropy positions.
- Class-axis: For each chosen (sequence, position) pair, sample classes from the teacher's distribution, forming a truncated supervision set.
The combined selective KD loss is: This extension allows fine-grained control over computational efficiency across the dataset and model output space (Tavor et al., 1 Feb 2026).
4. Hyperparameters, Practical Tuning, and Ablation Results
Key hyperparameters for SE-KD tuning include:
- (position-budget): Optimal values cluster at , achieving peak accuracy with most of the efficiency gains. Performance remains robust for as low as 1%.
- (sample-budget): Typically set at 20%, with minimal effect on accuracy and linear runtime improvements.
- (class samples): provides the bias-variance tradeoff for class-axis sampling.
- (KD loss weight): Pure KL () is favored for efficient gradient backpropagation.
- Curriculum, window, and queue parameters for baseline comparisons: Explicit curriculum shifting and global percentile baselines do not outperform top- entropy selection.
Empirical ablations show that student entropy is a strong selection signal for position importance, outperforming teacher entropy and other metrics. Top-20% SE-KD exceeds dense supervision (full KD) in both accuracy and compute/memory efficiency (Tavor et al., 1 Feb 2026).
5. Implementation Complexity and Resource Utilization
SE-KD provides substantial reductions in computational and memory requirements versus dense KD. For batch size , sequence length , and vocab size with :
- Student peak memory drops from 15.88 GB to 11.42 GB (−28.1%)
- Teacher peak memory drops by 9.4%
- Total GPU memory reduction of 18.3%
- Wall time for 80M tokens: full KD = 22 h 52 m; SE-KD (end-to-end) = 8 h 46 m; with sample selection and offline caching = 3 h 58 m (5.76× speedup)
- Offline cache storage for 100B tokens: full KD = 10,000 TB; RS-KD (U=64) = 19.2 TB; SE-KD (ℓ=20%) = 3.84 TB (an 80% reduction over RS-KD, nearly 4 orders of magnitude over dense KD) (Tavor et al., 1 Feb 2026)
These resource savings, particularly when combined across axes in SE-KD³ˣ, make previously prohibitive large-scale distillation manageable.
6. Empirical Performance and Task Generalization
SE-KD achieves higher or comparable performance to full KD across multiple benchmarks:
| Method | Avg. Acc. (%) | IFEval (%) | PPL ↓ | ECE ↓ |
|---|---|---|---|---|
| No KD | 61.9 | 19.4 | 12.2 | 30.5 |
| Full KD | 64.4 | 20.5 | 7.3 | 27.3 |
| SE-KD | 64.8 | 21.4 | 6.9 | 27.6 |
Task-specific distillation with SE-KD demonstrates robust transferability (GSM8K: off-policy 69.5%, on-policy 70.0%), with multi-axis extensions yielding further improvements when combining position and sample sparsification (Tavor et al., 1 Feb 2026).
7. Comparative Context and Extensions
The concept of entropy-guided position selection aligns closely with analogous approaches in image/video compression, e.g., the dependency-weighted spatial context and entropy maps in the Context Guided Transformer (CGT) model (Tong et al., 3 Aug 2025). In CGT, the teacher’s spatial attention and entropy scores are combined to select decoding positions, while SE-KD in the LLM setting leverages the student’s output entropy as the selection principle. Both frameworks aim to optimize the allocation of high-cost teacher supervision to positions where it is most likely to improve model performance per unit cost.
In summary, student-entropy-guided position selection, and its multi-axis SE-KD³ˣ variant, provides a principled and highly efficient mechanism for selective distillation in large models, resulting in significant reductions in computation, memory use, and storage requirements without sacrificing performance or downstream generality (Tavor et al., 1 Feb 2026, Tong et al., 3 Aug 2025).