TIP: Token Importance in On-Policy Distillation

Published 15 Apr 2026 in cs.LG and cs.AI | (2604.14084v1)

Abstract: On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces TIP, a framework that decomposes token importance by analyzing student entropy and teacher–student divergence in on-policy distillation.
It reveals that overconfident errors (Q3 tokens) offer dense corrective signals, which entropy-only methods overlook, thereby limiting learning efficiency.
Empirical results across multiple LLM architectures demonstrate that selecting high-impact tokens can reduce memory usage by up to 47% while nearly matching full-token supervision.

TIP: Token Importance in On-Policy Distillation

Introduction

This work provides a rigorous analysis of token-level supervision in on-policy distillation (OPD) for LLMs. The authors introduce TIP (Token Importance in on-Policy distillation), a framework that systematically decomposes token importance during student model training into two principal axes: student entropy and teacher–student divergence. By examining which tokens impart maximal learning signal in OPD, the paper establishes a taxonomy that organizes these positions and presents empirical and theoretical evidence for why commonly used entropy-based selection methods are incomplete. TIP motivates and formalizes a type-aware token selection strategy that is shown to improve the efficiency and effectiveness of distillation procedures, particularly under resource constraints.

TIP Taxonomy and Theoretical Analysis

The core insight is that not all token positions contribute equally to the learning signal during OPD. The paper observes that informative tokens are concentrated in:

High-entropy positions: Here, the student is uncertain; these are typically easy to detect and are retained by entropy-based methods.
Low-entropy, high-divergence positions: The student is confident but misaligned with the teacher (i.e., overconfident and wrong); these tokens encode dense corrective signal but are systematically missed by entropy-only selection.

The TIP taxonomy explicitly categorizes tokens based on these criteria into four quadrants:

Q1: High entropy, high divergence—student is uncertain and mistaken.
Q2: High entropy, low divergence—student is uncertain but mostly aligned with the teacher.
Q3: Low entropy, high divergence—student is confidently wrong (overconfident errors).
Q4: Low entropy, low divergence—student is confidently correct; negligible learning signal.

Theoretical results include:

A token-weighting bound showing that Q1 > Q2 > Q3 ≫ Q4 in terms of expected effective learning signal.
The provable limitation that any entropy-only criterion is structurally blind to Q3 (overconfident error) tokens.
A parameter-free Soft-OR score that combines normalized entropy and divergence, which captures both high-entropy and overconfident error tokens and tracks the oracle token ranking.

Empirical Validation

Mathematical Reasoning

Experiments span multiple architectures and teacher–student pairs (Qwen3, Llama, Qwen2.5), focusing on math benchmarks (MATH-500, AIME 2024/2025):

Retaining only 50% of tokens (highest entropy) using entropy-based selection matches or exceeds all-token supervision, reducing memory by up to 47%. More aggressive retention (down to 20% or 10%) degrades performance marginally, demonstrating that token selection can substantially improve memory efficiency.
Isolating Q3 tokens (overconfident errors, <10% of all tokens) and training on these alone nearly matches full-token distillation, empirically confirming their dense corrective value and the inadequacy of entropy-only routing—which drops these tokens by design.

The Soft-OR score, which selects tokens if either entropy or divergence is high, consistently outperforms entropy-only selection across all model pairs and datasets. For example, with Qwen3-8B→4B, Soft-OR 50% achieves higher accuracy than entropy-only 50% and the all-token baseline. Training on the lowest scoring (Q4) tokens plummets performance, supporting the theoretical ordering.

Teacher entropy was found to be uninformative: distributions are nearly deterministic for all teacher models, and adaptive weightings based on teacher entropy give no measurable improvement.

Agentic Planning

On DeepPlanning, an agentic planning benchmark, the taxonomy generalizes:

Entropy-only selection matches or exceeds full-token OPD at 50% retention.
Most notably, training only on Q3 tokens (20% retention) significantly outperforms full-token training, highlighting the special importance of overconfident corrections when a single decision can invalidate an entire output trajectory.

Practical and Theoretical Implications

The findings imply that:

Token-efficient distillation is not only feasible but can be systematically improved by augmenting entropy-based selection with teacher–student divergence.
Overconfidence detection (Q3) is essential for correcting systematic student failures, and its neglect by entropy-only methods can substantially limit the efficacy of student training.
The TIP framework and associated Soft-OR score are computation-free during standard autograd-based OPD and require only in-batch normalization and top-k selection, making them easy to integrate into practical pipelines.
The TIP analysis is general and applies naturally to RLHF, process reward fine-tuning, and speculative decoding settings, wherever token-level online supervision is available.

Limitations and Open Questions

Q3 detection is contingent on access to teacher output distributions, but this overhead is marginal since divergence is computed during the standard OPD loss.
The normalization strategy employed (min-max per batch) could potentially be improved, e.g., by smoothing or using alternative statistics to address outlier sensitivity.
The entire analysis assumes reverse KL loss; whether the taxonomy and ordering persist under alternative losses (e.g., forward KL, JSD) is open.

Conclusion

TIP offers an actionable framework for identifying and leveraging the most informative tokens for student model supervision in on-policy distillation. It demonstrates both theoretically and empirically that combining entropy and teacher–student divergence enables more effective and resource-efficient training, with robust improvements across reasoning and agentic planning tasks. The framework sets a foundation for further advances in targeted distillation and training signal allocation, with broad applicability to state-of-the-art LLM alignment and efficiency methods.

Reference: "TIP: Token Importance in On-Policy Distillation" (2604.14084)

Markdown Report Issue