SE-KD 3X: Entropy-Guided Distillation
- SE-KD 3X is a selective knowledge distillation approach that uses the student model’s entropy to target specific token positions, vocabulary classes, and training samples.
- It decomposes the distillation process into three axes—position, class, and sample—thereby reducing computational, memory, and storage overheads.
- Empirical results demonstrate that SE-KD 3X achieves comparable accuracy to full KD while significantly improving training speed and resource efficiency.
SE-KD 3X (“Student-Entropy-guided Knowledge Distillation across three axes”) is a technique for selective knowledge distillation (KD) in autoregressive LLMs, designed to improve efficiency by applying distillation over a subset of token positions, vocabulary classes, and training samples. Guided entirely by the entropy of the student model's predictions, SE-KD 3X achieves substantial reductions in computational, memory, and storage overheads while maintaining the accuracy and downstream task adherence of conventional dense KD (Tavor et al., 1 Feb 2026).
1. Framework and Three-Axis Decomposition
Traditional (Full) KD for LLMs supervises the student distribution to match the teacher distribution at every token position, over the full vocabulary and the entire dataset. SE-KD 3X decomposes the selection of distillation targets along three orthogonal axes, each controlled by a binary indicator:
- Position axis: selects whether to supervise position in sample .
- Class axis: designates a subset of vocabulary classes included in the KL-divergence term at position .
- Sample axis: selects whether sample is included in KD.
The selective KD loss for a given sample , position , can be written as:
where only the selected positions, classes, and samples contribute to the loss, and the overall objective averages over the selections.
2. Student-Entropy-Guided Selection Criteria
SE-KD 3X exclusively uses entropy-based criteria derived from the student model to drive selection along all three axes, removing dependence on teacher-side importance ranking.
- Position selection: The Shannon entropy of the student’s predicted distribution at each position,
,
is used to score tokens. The top of positions (e.g., ) with the highest entropy are selected per sequence via a threshold .
- Sample selection: The average token entropy across each sample is computed:
.
The top of samples (typically ) by are chosen for distillation.
- Class selection: At each selected , Random-Sampling KD (RS-KD) samples classes (with ), with indices , yielding . This set forms the support of a sparse target distribution used for supervision; common settings use .
This purely student-guided mechanism focuses distillation on positions and samples where the student model exhibits uncertainty and limits computation to a sparse subset of classes.
3. Multi-Axis Loss Formulation
Combining all selection mechanisms, and specializing to (pure KL), the SE-KD 3X objective is
This enforces fixed per-sequence and per-batch supervision budgets by normalizing over selected samples and positions.
4. Training Workflow and Practical Implementation
SE-KD 3X operates in two stages. First, candidate samples are scored and selected via a no-gradient student pass. Next, an offline teacher-class cache is constructed: for each selected sample and token, the teacher’s conditional distribution is sparsely sampled along the class axis. Online, the distillation loop consists of:
- Scoring token positions in each batch by the student’s entropy and selecting the top .
- Running the student forward pass at selected positions only (using, e.g., selective LM heads and chunked streaming).
- Loading precomputed sparse teacher targets per position from the class cache.
- Computing the KL loss only on the selected tokens, samples, and classes.
- Performing parameter updates.
The following table summarizes the three axes and their selection rules:
| Axis | Selection Rule | Typical Budget |
|---|---|---|
| Position | Top- token entropy | |
| Sample | Top- avg. entropy | |
| Class | RS-KD, samples from |
5. Empirical Results and Efficiency Gains
Experimental evaluation distilling Qwen3-1.7B from Qwen3-8B on 80 million FineWeb-Edu tokens demonstrates that SE-KD 3X matches the accuracy of Full KD on reasoning (64.4%), LAMBADA perplexity (PPL ≈ 7.3), and instruction-following tasks (Pass@1 ≈ 20.7%), with minimal loss compared to dense supervision.
Notable efficiency improvements include:
- Wall-clock time: Reduces from 22 h 52 m (Full KD) to 3 h 58 m with SE-KD 3X (70%+ speedup).
- Peak GPU memory: Decreases from 33.18 GB (Full KD) to 27.10 GB (–18.3%).
- Teacher logits storage: Reduces by 99.96% versus Full KD; for 100B tokens, from 10,000 TB to 3.84 TB.
A summary of key outcomes is given below:
| Metric | Full KD | SE-KD 3X | Improvement |
|---|---|---|---|
| Wall time (80M tokens) | 22h 52m | 3h 58m | –70% |
| Peak GPU memory | 33.18 GB | 27.10 GB | –18.3% |
| Storage (100B tokens) | 10,000 TB | 3.84 TB | –99.96% |
6. Mechanistic Synergies and Offline Caching
Each selection axis yields distinct efficiency benefits:
- Position selection targets high-entropy tokens, thereby guiding supervision to areas where it is expected to be most beneficial and also enabling optimizations such as chunked-streaming and selective LM head instantiation.
- Class sampling (RS-KD) sparsifies each distribution to a modest subset (), which greatly reduces both storage and teacher compute requirements per token.
- Sample selection (by average student entropy) prunes a majority of samples from KD, providing a linear decrease in wall-clock, cache size, and teacher load.
The complementarity of these axes makes it feasible to build a compact offline cache of teacher targets, storing only sampled classes for selected tokens and samples. This structure allows the main KD loop to avoid repeated teacher forward passes and dramatically improves training throughput and scalability.
7. Significance and Implications
SE-KD 3X establishes that selective KD guided solely by the student’s uncertainty can achieve practical reductions in the cost of distilling large LLMs, without relying on teacher-side ranking or dense teacher supervision. By ensuring that only a small subset of samples, positions, and classes receive supervision, it is possible to distill high-quality student models under strict resource budgets and to leverage offline teacher caching at scale. These findings provide a foundation for further research into sparse and student-driven distillation regimes in autoregressive LLMs (Tavor et al., 1 Feb 2026).