Token-Level Knowledge Distillation

Updated 16 February 2026

Token-Level Knowledge Distillation is a fine-grained model compression technique that transfers nuanced, per-token output distributions from a large teacher to a compact student model.
The approach employs token-wise KL divergence and adaptive mechanisms like temperature scaling and divergence control to precisely match the teacher’s soft targets.
It enhances performance in tasks such as language modeling, translation, and cross-lingual understanding while addressing challenges like tokenizer mismatches and token selection overload.

Token-level knowledge distillation is a fine-grained model compression strategy in which a compact student model learns to mimic the token-wise output distributions, rationales, relationships, or attribution structures produced by a larger teacher model. In contrast to sequence- or sentence-level approaches, token-level distillation captures the nuanced local information present at each token position, providing a robust mechanism for transferring both predictive distributions and the associated deep semantics. Recent research demonstrates that this approach enhances task performance, accelerates convergence, and improves the transferability of knowledge, particularly in settings with limited model capacity or in applications such as language modeling, translation, speech, and cross-lingual understanding.

1. Mathematical Formulations of Token-Level Distillation

The canonical objective in token-level knowledge distillation is to align the per-token prediction distributions between the teacher and the student, commonly using the Kullback–Leibler (KL) divergence. For a source sequence $x=(x_1,\ldots,x_n)$ and its target $y=(y_1,\ldots,y_m)$ , with teacher logits $z_T$ and student logits $z_S$ , the softened token-wise probability for token $v$ at position $j$ (using temperature $T$ ) is:

$P_T(y_j = v\mid x;T) = \frac{\exp(z_T(v,x)/T)}{\sum_{v'}\exp(z_T(v',x)/T)}$

The student probability $P_S(y_j = v\mid x;T)$ is defined analogously. The token-level distillation loss, summed over sequence positions and vocabulary, is:

$\mathcal{L}_{\mathrm{token}}(x;T) = -\sum_{j=1}^m \sum_{v} P_T(y_j=v\mid x;T) \log P_S(y_j=v\mid x;T)$

This loss is typically combined with conventional maximum likelihood or cross-entropy loss on ground truth tokens, or further augmented using auxiliary losses when token rationales or representations are considered (Wei et al., 2024, Li et al., 2022, Sun et al., 2019).

Several variants extend this framework:

Token-wise divergence control employs adaptive mixing of forward and reverse KL at each token:

$y=(y_1,\ldots,y_m)$ 0

where $y=(y_1,\ldots,y_m)$ 1 depends on the log-ratio between teacher and student probabilities for token $y=(y_1,\ldots,y_m)$ 2 at time $y=(y_1,\ldots,y_m)$ 3 (Jung et al., 22 May 2025).

Token-adaptive temperature scaling and selection focuses distillation on difficult tokens by dynamic metrics (e.g., Hellinger distance) and per-token temperature heuristics, sharpening or smoothing the teacher’s output distribution to accelerate correction and improve generalization (Xie et al., 13 Oct 2025).

2. Token Attribution, Rationale Extraction, and Representational Guidance

Beyond matching output distributions, token-level distillation is also employed to transfer teacher rationales, attribution signals, or deep token-level representations. For instance:

Saliency-based rationale extraction: A teacher model identifies the most influential input tokens by gradient-based saliency ( $y=(y_1,\ldots,y_m)$ 4), and these are used as rationales in the student's input or output sequence. Training jointly optimizes for both rationale generation and label prediction, using a mixed cross-entropy loss:

$y=(y_1,\ldots,y_m)$ 5

with $y=(y_1,\ldots,y_m)$ 6 balancing ground-truth supervision and attribution-guided distillation (Ballout et al., 2024).

Contrastive representation alignment: In multilingual setups, student masked-token representations are pulled towards corresponding teacher features via a cross-lingual word-aware contrastive loss (XWCL):

$y=(y_1,\ldots,y_m)$ 7

promoting fine-grained cross-lingual semantic alignment (Li et al., 2022).

3. Token-Level Distillation Variants and Adaptive Techniques

Recent literature introduces several advanced mechanisms leveraging token-level predictions:

Ensemble distillation: Per-token soft targets from multiple teachers (possibly of different architectures) are averaged, enhancing knowledge transfer and generalizability; pseudo-labels on unlabeled data further enrich supervision (Sun et al., 2019).
Token-wise divergence adaptation: ToDi adaptively blends forward and reverse KL using a sigmoid function of the teacher–student log-ratio, targeting underestimated (FKL) or overestimated (RKL) tokens as needed (Jung et al., 22 May 2025).
Token-adaptive temperature scaling: AdaKD selects hard tokens by Hellinger distance, concentrates the loss on top-ranked tokens dynamically (LATF), and applies inverse-difficulty temperature scaling to fast-track learning where most required (Xie et al., 13 Oct 2025).
Delta-KD: Rather than matching the absolute teacher outputs, Delta-KD distills the distributional shift the teacher experienced during supervised fine-tuning (SFT), computed as

$y=(y_1,\ldots,y_m)$ 8

enabling the student to internalize the adjustment direction rather than the teacher’s final solution (Cao et al., 18 Sep 2025).

4. Application Modalities: Language, Vision, Multimodal, and Beyond

Token-level distillation has been extensively adopted across domains:

Sequence modeling and language tasks: G2P conversion, NMT, summarization, question answering, arithmetic word problems, and instruction-following LLMs have documented improvements through per-token distribution matching or rationale integration (Sun et al., 2019, Wei et al., 2024, Ballout et al., 2024, Zhang et al., 4 Mar 2025).
Cross-lingual and multilingual alignment: MMKD leverages token-level contrastive objectives to transfer semantic alignment, improving zero-shot generalization and low-resource language performance (Li et al., 2022).
Visual and multimodal networks: In speaker verification, an auxiliary “Distillation Token” is injected into the multihead self-attention student architecture, trained via token-level KL to mimic the teacher’s soft predictions, complementing a hard-supervised class token (Mingote et al., 2021). In visual classification, the Token-level Relationship Graph (TRG) preserves both intra-image contextuality and inter-token relations across images, producing SOTA gains on balanced and imbalanced datasets (Zhang et al., 2023).
Alignment and RLHF-equivalent objectives: AlignDistil formulates token-level distributional reward optimization equivalent to RLHF with DPO-derived token-wise rewards and adaptively extrapolated logit targets, yielding superior LLM alignment quality and rapid convergence (Zhang et al., 4 Mar 2025).

5. Sequence- vs. Token-Level Distillation: Empirical and Practical Considerations

Empirical studies establish that token-level distillation provides a more expressive and robust learning signal in many—but not all—settings:

When advantageous: With sufficient student capacity, clean data, or straightforward decoding (e.g., teacher forcing in NMT), token-level objectives yield up to +1.5 BLEU over sequence-level baselines (Wei et al., 2024). In G2P, token-level ensemble distillation gives lower WER than top-1 sequence distillation (19.88% vs. 20.32%) (Sun et al., 2019).
Limitations: For small students, noisy data, or difficult decoders, sentence- or sequence-level distillation is often preferable; pure token-level schemes can degrade performance due to overfitting or error propagation (Wei et al., 2024).
Hybrid mechanisms: Dynamic gating between sentence- and token-level losses—where the loss weight is learned as a function of the input—outperforms either alone, adapting the signal to task complexity and student competence (Wei et al., 2024).
Token selection and overload: Empirical ablations indicate that over-selecting tokens (e.g., rationale length $y=(y_1,\ldots,y_m)$ 9) leads to information overload, while random token selection undermines performance (Ballout et al., 2024).

6. Token-Level Distillation for Cross-Tokenizer and Cross-Architecture Transfer

A persistent challenge for KD is the mismatch in tokenizer vocabularies or architectural differences between teacher and student. The Multi-Level Optimal Transport (MultiLevelOT) framework addresses this by aligning teacher and student logit distributions at both the token and sequence levels using entropy-regularized optimal transport (OT). By constructing global- and local-aware cost matrices across top- $z_T$ 0 logits and optimizing with fast Sinkhorn iterations, this approach bypasses the need for explicit token alignment. The loss

$z_T$ 1

effectively enables universal distillation across families and vocabularies, with robust downstream gains (Cui et al., 2024).

7. Impact, Limitations, and Outlook

Token-level knowledge distillation is empirically validated as the preferred strategy for high-fidelity, fine-grained knowledge transfer in many modern deep learning applications:

Performance: It delivers up to 1–2% absolute accuracy improvement over standard student training or sequence-level distillation, and demonstrates robustness to architecture heterogeneity and long-tailed class distributions (Ballout et al., 2024, Zhang et al., 2023, Li et al., 2022).
Interpretability and analysis: Attribution-guided rationale distillation not only boosts performance but also lends interpretive value, as top-attribution tokens are shown to overlap with ground-truth labels in over 68% of multiple-choice QA cases (Ballout et al., 2024).
Challenges: Determining when token-level objectives are optimal, handling information overload, and managing vocabulary or feature misalignments are ongoing areas of research. Hybrid strategies, token-adaptive weighting, and representation-level objectives offer promising mitigation (Jung et al., 22 May 2025, Xie et al., 13 Oct 2025, Cui et al., 2024).
Generality: The approach readily extends to vision, cross-lingual, multimodal, and RL-aligned networks, reflecting its versatility.