Identify effective token-importance signals and position-selection policies for selective KD in autoregressive LLMs

Determine which token-level importance signals most reliably identify positions that benefit from logit-based knowledge distillation in autoregressive large language models, and characterize how different position-selection policies interact with these signals to yield an effective distillation curriculum.

Background

Knowledge distillation for autoregressive LLMs typically matches the teacher’s next-token distribution at every position, but recent work suggests that selectively supervising a subset of positions can improve performance. A variety of token-importance signals (such as entropy, cross-entropy, and teacher–student divergence) and position-selection policies (such as top-k selection, global-level selection, curriculum scheduling, and stochastic sampling) have been proposed, yet their comparative effectiveness and interactions remain uncertain.

This paper frames selective knowledge distillation across multiple axes and focuses on the position axis, emphasizing the need to rigorously determine which uncertainty or discrepancy signals best identify high-value tokens for distillation and how selection policies shape the resulting training curriculum. The authors explicitly note that this uncertainty motivates their systematic analysis and the development of student-entropy-guided selection methods.

References

Yet, it remains unclear which token-importance signals most reliably identify positions that benefit from logit-based distillation in LLMs, and how different position-selection policies interact with these signals to shape an effective distillation curriculum.

Rethinking Selective Knowledge Distillation  (2602.01395 - Tavor et al., 1 Feb 2026) in Section 1 (Introduction), second paragraph