Papers
Topics
Authors
Recent
Search
2000 character limit reached

SE-KD 3X: Entropy-Guided Distillation

Updated 4 February 2026
  • SE-KD 3X is a selective knowledge distillation approach that uses the student model’s entropy to target specific token positions, vocabulary classes, and training samples.
  • It decomposes the distillation process into three axes—position, class, and sample—thereby reducing computational, memory, and storage overheads.
  • Empirical results demonstrate that SE-KD 3X achieves comparable accuracy to full KD while significantly improving training speed and resource efficiency.

SE-KD 3X (“Student-Entropy-guided Knowledge Distillation across three axes”) is a technique for selective knowledge distillation (KD) in autoregressive LLMs, designed to improve efficiency by applying distillation over a subset of token positions, vocabulary classes, and training samples. Guided entirely by the entropy of the student model's predictions, SE-KD 3X achieves substantial reductions in computational, memory, and storage overheads while maintaining the accuracy and downstream task adherence of conventional dense KD (Tavor et al., 1 Feb 2026).

1. Framework and Three-Axis Decomposition

Traditional (Full) KD for LLMs supervises the student distribution qtq_t to match the teacher distribution ptp_t at every token position, over the full vocabulary V\mathcal V and the entire dataset. SE-KD 3X decomposes the selection of distillation targets along three orthogonal axes, each controlled by a binary indicator:

  • Position axis: mt(i){0,1}m_t^{(i)} \in \{0,1\} selects whether to supervise position tt in sample ii.
  • Class axis: Ct(i)V\mathcal C_t^{(i)} \subseteq \mathcal V designates a subset of vocabulary classes included in the KL-divergence term at position (i,t)(i, t).
  • Sample axis: si{0,1}s_i \in \{0,1\} selects whether sample ii is included in KD.

The selective KD loss for a given sample ii, position tt, can be written as:

SKD(i)(t)=mt(i)[λKLCt(i)(ptqt)+(1λ)CE(yt,qt)]\ell_{\mathrm{SKD}}^{(i)}(t) = m_t^{(i)}\bigl[\lambda\,\mathrm{KL}_{\mathcal C_t^{(i)}}(p_t\|q_t) + (1-\lambda)\,\mathrm{CE}(y_t,q_t)\bigr]

where only the selected positions, classes, and samples contribute to the loss, and the overall objective averages over the selections.

2. Student-Entropy-Guided Selection Criteria

SE-KD 3X exclusively uses entropy-based criteria derived from the student model to drive selection along all three axes, removing dependence on teacher-side importance ranking.

  • Position selection: The Shannon entropy of the student’s predicted distribution at each position,

H(qt)=vVqt(v)logqt(v)H(q_t) = -\sum_{v\in\mathcal V} q_t(v) \log q_t(v),

is used to score tokens. The top k%k\% of positions (e.g., k=20%k=20\%) with the highest entropy are selected per sequence via a threshold τ\tau.

  • Sample selection: The average token entropy across each sample is computed:

Ui=1Li1t=1Li1H(qt)U_i = \frac{1}{L_i-1} \sum_{t=1}^{L_i-1} H(q_t).

The top %\ell\% of samples (typically =20%\ell = 20\%) by UiU_i are chosen for distillation.

  • Class selection: At each selected (i,t)(i, t), Random-Sampling KD (RS-KD) samples UU classes (with UVU \ll |\mathcal V|), with indices vkptv_k \sim p_t, yielding Ct(i)\mathcal C_t^{(i)}. This set forms the support of a sparse target distribution used for supervision; common settings use U=64U=64.

This purely student-guided mechanism focuses distillation on positions and samples where the student model exhibits uncertainty and limits computation to a sparse subset of classes.

3. Multi-Axis Loss Formulation

Combining all selection mechanisms, and specializing to λ=1\lambda=1 (pure KL), the SE-KD 3X objective is

LSE-KD3X=1isii=1Dsi[1tmt(i)t=1Li1mt(i)KLCt(i)(ptqt)]\mathcal{L}_{\mathrm{SE\text{-}KD3X}} = \frac{1}{\sum_i s_i}\sum_{i=1}^{|\mathcal D|} s_i \left[\frac{1}{\sum_t m_t^{(i)}}\sum_{t=1}^{L_i-1} m_t^{(i)}\, \mathrm{KL}_{\mathcal C_t^{(i)}} (p_t \| q_t) \right]

This enforces fixed per-sequence and per-batch supervision budgets by normalizing over selected samples and positions.

4. Training Workflow and Practical Implementation

SE-KD 3X operates in two stages. First, candidate samples are scored and selected via a no-gradient student pass. Next, an offline teacher-class cache is constructed: for each selected sample and token, the teacher’s conditional distribution is sparsely sampled along the class axis. Online, the distillation loop consists of:

  1. Scoring token positions in each batch by the student’s entropy and selecting the top k%k\%.
  2. Running the student forward pass at selected positions only (using, e.g., selective LM heads and chunked streaming).
  3. Loading precomputed sparse teacher targets per position from the class cache.
  4. Computing the KL loss only on the selected tokens, samples, and classes.
  5. Performing parameter updates.

The following table summarizes the three axes and their selection rules:

Axis Selection Rule Typical Budget
Position Top-k%k\% token entropy k=20%k=20\%
Sample Top-%\ell\% avg. entropy =20%\ell=20\%
Class RS-KD, UU samples from ptp_t U=64U=64

5. Empirical Results and Efficiency Gains

Experimental evaluation distilling Qwen3-1.7B from Qwen3-8B on 80 million FineWeb-Edu tokens demonstrates that SE-KD 3X matches the accuracy of Full KD on reasoning (64.4%), LAMBADA perplexity (PPL ≈ 7.3), and instruction-following tasks (Pass@1 ≈ 20.7%), with minimal loss compared to dense supervision.

Notable efficiency improvements include:

  • Wall-clock time: Reduces from 22 h 52 m (Full KD) to 3 h 58 m with SE-KD 3X (70%+ speedup).
  • Peak GPU memory: Decreases from 33.18 GB (Full KD) to 27.10 GB (–18.3%).
  • Teacher logits storage: Reduces by 99.96% versus Full KD; for 100B tokens, from 10,000 TB to 3.84 TB.

A summary of key outcomes is given below:

Metric Full KD SE-KD 3X Improvement
Wall time (80M tokens) 22h 52m 3h 58m –70%
Peak GPU memory 33.18 GB 27.10 GB –18.3%
Storage (100B tokens) 10,000 TB 3.84 TB –99.96%

6. Mechanistic Synergies and Offline Caching

Each selection axis yields distinct efficiency benefits:

  • Position selection targets high-entropy tokens, thereby guiding supervision to areas where it is expected to be most beneficial and also enabling optimizations such as chunked-streaming and selective LM head instantiation.
  • Class sampling (RS-KD) sparsifies each distribution to a modest subset (UVU \ll |\mathcal V|), which greatly reduces both storage and teacher compute requirements per token.
  • Sample selection (by average student entropy) prunes a majority of samples from KD, providing a linear decrease in wall-clock, cache size, and teacher load.

The complementarity of these axes makes it feasible to build a compact offline cache of teacher targets, storing only sampled classes for selected tokens and samples. This structure allows the main KD loop to avoid repeated teacher forward passes and dramatically improves training throughput and scalability.

7. Significance and Implications

SE-KD 3X establishes that selective KD guided solely by the student’s uncertainty can achieve practical reductions in the cost of distilling large LLMs, without relying on teacher-side ranking or dense teacher supervision. By ensuring that only a small subset of samples, positions, and classes receive supervision, it is possible to distill high-quality student models under strict resource budgets and to leverage offline teacher caching at scale. These findings provide a foundation for further research into sparse and student-driven distillation regimes in autoregressive LLMs (Tavor et al., 1 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SE-KD 3X.