- The paper introduces TIP, a framework that decomposes token importance by analyzing student entropy and teacher–student divergence in on-policy distillation.
- It reveals that overconfident errors (Q3 tokens) offer dense corrective signals, which entropy-only methods overlook, thereby limiting learning efficiency.
- Empirical results across multiple LLM architectures demonstrate that selecting high-impact tokens can reduce memory usage by up to 47% while nearly matching full-token supervision.
TIP: Token Importance in On-Policy Distillation
Introduction
This work provides a rigorous analysis of token-level supervision in on-policy distillation (OPD) for LLMs. The authors introduce TIP (Token Importance in on-Policy distillation), a framework that systematically decomposes token importance during student model training into two principal axes: student entropy and teacher–student divergence. By examining which tokens impart maximal learning signal in OPD, the paper establishes a taxonomy that organizes these positions and presents empirical and theoretical evidence for why commonly used entropy-based selection methods are incomplete. TIP motivates and formalizes a type-aware token selection strategy that is shown to improve the efficiency and effectiveness of distillation procedures, particularly under resource constraints.
TIP Taxonomy and Theoretical Analysis
The core insight is that not all token positions contribute equally to the learning signal during OPD. The paper observes that informative tokens are concentrated in:
- High-entropy positions: Here, the student is uncertain; these are typically easy to detect and are retained by entropy-based methods.
- Low-entropy, high-divergence positions: The student is confident but misaligned with the teacher (i.e., overconfident and wrong); these tokens encode dense corrective signal but are systematically missed by entropy-only selection.
The TIP taxonomy explicitly categorizes tokens based on these criteria into four quadrants:
- Q1: High entropy, high divergence—student is uncertain and mistaken.
- Q2: High entropy, low divergence—student is uncertain but mostly aligned with the teacher.
- Q3: Low entropy, high divergence—student is confidently wrong (overconfident errors).
- Q4: Low entropy, low divergence—student is confidently correct; negligible learning signal.
Theoretical results include:
- A token-weighting bound showing that Q1 > Q2 > Q3 ≫ Q4 in terms of expected effective learning signal.
- The provable limitation that any entropy-only criterion is structurally blind to Q3 (overconfident error) tokens.
- A parameter-free Soft-OR score that combines normalized entropy and divergence, which captures both high-entropy and overconfident error tokens and tracks the oracle token ranking.
Empirical Validation
Mathematical Reasoning
Experiments span multiple architectures and teacher–student pairs (Qwen3, Llama, Qwen2.5), focusing on math benchmarks (MATH-500, AIME 2024/2025):
- Retaining only 50% of tokens (highest entropy) using entropy-based selection matches or exceeds all-token supervision, reducing memory by up to 47%. More aggressive retention (down to 20% or 10%) degrades performance marginally, demonstrating that token selection can substantially improve memory efficiency.
- Isolating Q3 tokens (overconfident errors, <10% of all tokens) and training on these alone nearly matches full-token distillation, empirically confirming their dense corrective value and the inadequacy of entropy-only routing—which drops these tokens by design.
The Soft-OR score, which selects tokens if either entropy or divergence is high, consistently outperforms entropy-only selection across all model pairs and datasets. For example, with Qwen3-8B→4B, Soft-OR 50% achieves higher accuracy than entropy-only 50% and the all-token baseline. Training on the lowest scoring (Q4) tokens plummets performance, supporting the theoretical ordering.
Teacher entropy was found to be uninformative: distributions are nearly deterministic for all teacher models, and adaptive weightings based on teacher entropy give no measurable improvement.
Agentic Planning
On DeepPlanning, an agentic planning benchmark, the taxonomy generalizes:
- Entropy-only selection matches or exceeds full-token OPD at 50% retention.
- Most notably, training only on Q3 tokens (20% retention) significantly outperforms full-token training, highlighting the special importance of overconfident corrections when a single decision can invalidate an entire output trajectory.
Practical and Theoretical Implications
The findings imply that:
- Token-efficient distillation is not only feasible but can be systematically improved by augmenting entropy-based selection with teacher–student divergence.
- Overconfidence detection (Q3) is essential for correcting systematic student failures, and its neglect by entropy-only methods can substantially limit the efficacy of student training.
- The TIP framework and associated Soft-OR score are computation-free during standard autograd-based OPD and require only in-batch normalization and top-k selection, making them easy to integrate into practical pipelines.
- The TIP analysis is general and applies naturally to RLHF, process reward fine-tuning, and speculative decoding settings, wherever token-level online supervision is available.
Limitations and Open Questions
- Q3 detection is contingent on access to teacher output distributions, but this overhead is marginal since divergence is computed during the standard OPD loss.
- The normalization strategy employed (min-max per batch) could potentially be improved, e.g., by smoothing or using alternative statistics to address outlier sensitivity.
- The entire analysis assumes reverse KL loss; whether the taxonomy and ordering persist under alternative losses (e.g., forward KL, JSD) is open.
Conclusion
TIP offers an actionable framework for identifying and leveraging the most informative tokens for student model supervision in on-policy distillation. It demonstrates both theoretically and empirically that combining entropy and teacher–student divergence enables more effective and resource-efficient training, with robust improvements across reasoning and agentic planning tasks. The framework sets a foundation for further advances in targeted distillation and training signal allocation, with broad applicability to state-of-the-art LLM alignment and efficiency methods.
Reference: "TIP: Token Importance in On-Policy Distillation" (2604.14084)