Papers
Topics
Authors
Recent
Search
2000 character limit reached

Token Relation Distillation (TRD)

Updated 19 March 2026
  • Token Relation Distillation (TRD) is a method that transfers fine-grained token-level relational structures from a teacher model to a smaller student, improving semantic alignment.
  • It constructs a Token-level Relationship Graph (TRG) to capture both intra-instance context and cross-instance token similarities, enabling robust and spatially-aware knowledge transfer.
  • Empirical results demonstrate that TRD boosts accuracy and robustness in both CNN and ViT architectures, especially under imbalanced classification scenarios.

Token Relation Distillation (TRD) introduces a methodology for enhancing knowledge distillation by explicitly transferring the token-level relational structure learned by a powerful teacher network to a typically smaller student model. Departing from conventional distillation approaches that emphasize either logits or instance-level relationships, TRD leverages a Token-level Relationship Graph (TRG) to encapsulate both intra-instance semantic context and cross-instance token-wise similarities. This graph-centric strategy enables the student to emulate fine-grained, higher-order semantic dependencies from the teacher, with demonstrated advantages on balanced and imbalanced classification tasks across CNN and ViT architectures (Zhang et al., 2023).

1. Motivation and Theoretical Foundations

Traditional knowledge distillation (KD) as introduced by Hinton et al. (Hinton et al., 2015) focuses on transferring class probability distributions through softened logits:

LKD=1Ni=1NKL(piS(τ)piT(τ)),L_{KD} = \frac{1}{N} \sum_{i=1}^N \mathrm{KL}(p_i^S(\tau) \| p_i^T(\tau)),

yielding the standard loss formulation:

Llogit=LCE(pS,y)+λLKD.L_\text{logit} = L_\text{CE}(p^S, y) + \lambda L_{KD}.

Extensions incorporating feature-based alignment or instance-level relational graphs (e.g., RKD [Park et al., 2019], IRG [Liu et al., 2019]) improved transfer, but failed to explicitly model intra-image structure and higher-order semantic patterns.

The central hypothesis motivating TRD is that transferring the rich, token-level relational information—particularly relevant for patch-based architectures (e.g., ViT [Dosovitskiy et al., 2021]) or feature-aggregating CNNs—enables more complete knowledge transfer. This is especially beneficial in long-tailed settings, where rare classes may share semantic micro-patterns best captured at the token level.

Token-level relationships encode:

  • Inner-instance semantic context: How patches or regions within an image are related.
  • Cross-instance patch-to-patch similarity: How a token in one image relates to tokens in others.

By explicitly distilling this graph-structured data, TRD aims to bridge the capability gap between teacher and student, surpassing instance- or feature-level approaches (Zhang et al., 2023).

2. Construction of the Token-level Relationship Graph (TRG)

Token Representation

  • ViT-like networks: Images xRC×H×Wx \in \mathbb{R}^{C \times H \times W} are partitioned into M=HW/P2M = HW/P^2 patches xpRM×Dx_p \in \mathbb{R}^{M \times D} with D=P2CD = P^2 C. Teacher tokens TTRBM×dTT^T \in \mathbb{R}^{B \cdot M \times d_T}, student tokens TSRBM×dST^S \in \mathbb{R}^{B \cdot M \times d_S}.
  • CNN-like networks: The penultimate feature map FRC×H×WF \in \mathbb{R}^{C_\ell \times H_\ell \times W_\ell} is split into MM patches \rightarrow tokens TRM×dT \in \mathbb{R}^{M \times d}.

Every token, whether from teacher or student, is a dd-dimensional vector hih_i used as a node attribute in the graph.

Random Token Sampling

A full batch contains B×MB \times M tokens, often computationally unwieldy. TRD samples KK tokens per image using a shared random mask for both models, yielding N=BKN = B \cdot K tokens for both teacher and student:

TT={hiT}i=1N,TS={hiS}i=1N.T^T = \{ h^T_i \}_{i=1}^N, \quad T^S = \{ h^S_i \}_{i=1}^N.

Graph Construction

Two attributed graphs, GT=(V,AT)G^T = (V, A^T) and GS=(V,AS)G^S = (V, A^S), share vertex set V={1,...,N}V = \{1, ..., N\}, with adjacency matrices AT,ASRN×NA^T, A^S \in \mathbb{R}^{N \times N}. For iji \neq j:

A{}ij={exp(hihj22σ2)if ikNN(j) or jkNN(i) 0otherwiseA^\{\cdot\}_{ij} = \begin{cases} \exp \left(- \frac{\|h_i - h_j\|^2}{2\sigma^2} \right) & \text{if } i \in k\text{NN}(j) \text{ or } j \in k\text{NN}(i) \ 0 & \text{otherwise} \end{cases}

A dense token-wise relational similarity matrix RR can also be formed:

Rij=f(hi,hj)=cos(hi,hj)=hihjhihj,R_{ij} = f(h_i, h_j) = \cos(h_i, h_j) = \frac{h_i \cdot h_j}{\|h_i\| \|h_j\|},

optionally normalized:

Pij=Softmaxj(Rij/τR).P_{ij} = \mathrm{Softmax}_j(R_{ij}/\tau_R).

3. TRG-based Distillation Objectives

The total loss combines logit-based KD with several graph and token-level objectives:

Ltotal=Llogit+αLinner+βLlocal+γLglobalL_\text{total} = L_\text{logit} + \alpha L_\text{inner} + \beta L_\text{local} + \gamma L_\text{global}

with hyperparameters λ,α,β,γ,τ,τginit,WU\lambda, \alpha, \beta, \gamma, \tau, \tau_g^{\mathrm{init}}, W_U.

3.1 Local Preserving Loss

This term matches local neighborhood structure between student and teacher graphs:

Llocal=i=1NKL(Softmaxj(AijS)Softmaxj(AijT)).L_\text{local} = \sum_{i=1}^N \mathrm{KL}(\mathrm{Softmax}_j(A^S_{ij}) \,\|\, \mathrm{Softmax}_j(A^T_{ij})).

3.2 Global Contrastive Loss

Employing an InfoNCE objective, corresponding tokens across teacher and student are aligned, with negatives pushed apart. Student tokens are projected to teacher dimension if dSdTd_S \neq d_T:

SIM(hiS,hjT)=Proj(hiS)hjTProj(hiS)hjT\mathrm{SIM}(h_i^S, h_j^T) = \frac{\mathrm{Proj}(h_i^S) \cdot h_j^T}{\|\mathrm{Proj}(h_i^S)\| \|h_j^T\|}

Lglobal=i=1Nlogexp(SIM(hiS,hiT)/τg)j=1Nexp(SIM(hiS,hjT)/τg)L_\text{global} = - \sum_{i=1}^N \log \frac{\exp(\mathrm{SIM}(h_i^S, h_i^T)/\tau_g)}{\sum_{j=1}^N \exp(\mathrm{SIM}(h_i^S, h_j^T)/\tau_g)}

A dynamic temperature τg(e)\tau_g(e), where ee is the epoch, is adopted:

τg(e)={τginit,if eWU τginit/logWU(e),if e>WU\tau_g(e) = \begin{cases} \tau_g^{\mathrm{init}}, & \text{if } e \leq W_U \ \tau_g^{\mathrm{init}} / \log_{W_U}(e), & \text{if } e > W_U \end{cases}

3.3 Token-wise Contextual Loss

To transfer inner-instance semantic context, the method matches self-similarity matrices of patch tokens within each image. For penultimate feature map FF:

CS=Softmax(FFd),CST,CSSRM×M\mathrm{CS} = \mathrm{Softmax}\left(\frac{F F^\top}{\sqrt{d}} \right), \quad \mathrm{CS}^T, \mathrm{CS}^S \in \mathbb{R}^{M \times M}

Linner=CSTCSSF2L_\text{inner} = \|\mathrm{CS}^T - \mathrm{CS}^S\|^2_F

This constrains the student to preserve internal patch arrangements consistent with the teacher.

4. Empirical Evaluation

Datasets and Architectures

  • Datasets: CIFAR-100, CIFAR-100-LT (imbalance ratios 10, 50, 100), ImageNet-1K, ImageNet-LT (imbalance 10).
  • CNN-based models: ResNet-32×4 → ResNet-8×4, ResNet56 → ResNet20, VGG13 → VGG8, WRN-40-2 → WRN-40-1, ShuffleNet → MobileNet.
  • ViT-based models: DeiT-Tiny/Small students, ResNet-101 or CeiT-Base teachers.

Training Configurations

  • CIFAR-100: SGD with Nesterov ($0.9$ momentum, 5×1045\times 10^{-4} weight decay), 240 epochs, reduced LR at 150, 180, 210. τ=4\tau=4, τginit=0.1\tau_g^{\mathrm{init}}=0.1, warm-up WU=15W_U=15.
  • ImageNet & ViTs: CNNs – 100 epochs, cosine LR schedule, batch 128×4. ViTs – AdamW, LR=5×1045\times10^{-4}, 200 epochs, 10-epoch warm-up. All experiments on 4×RTX3090 GPUs.

Performance Summary

TRD achieves superior accuracy and robustness, notably:

Setting TRD Top-1 (%) Strongest Baseline (%)
ResNet-32×4 → 8×4 (CIFAR) 76.42 HKD: 76.21, KD: 74.12
ShuffleV1←ResNet32×4 76.42 DKD: 76.42, HKD: 75.99
ResNet34→ResNet18 (ImageNet) 71.31 KD +1.56%
DeiT-Tiny←ResNet101 75.5 KD: 74.8, HKD: 75.2
DeiT-Small←CeiT-Base 81.8 HKD: 81.3

On long-tailed variants:

  • CIFAR-100-LT: TRD degrades less as imbalance increases and can surpass teacher accuracy.
  • ImageNet-LT: TRD Top-1 50.32% (20.99%-20.99\% drop) vs. KD 46.70% (24.00%-24.00\% drop).

Ablation studies confirm each loss (LinnerL_\text{inner}, LlocalL_\text{local}, LglobalL_\text{global}) contributes 0.3%0.3\%1.0%1.0\% accuracy, with token-level graphs outperforming instance-level graphs. Dynamic τg\tau_g provides smoother optimization and lower embedding divergence.

5. Analysis: Contextualization and Visualizations

Several key insights arise from empirical analysis:

  • Larger batch sizes (e.g., 512 tokens) strengthen the representational capacity and graph structure, yielding marginal accuracy improvements.
  • t-SNE projections show TRD features are more class-separable than those from KD, IRG, or HKD, indicating successful transfer of fine-grained relational information.
  • The addition of each loss component leads to measurable, compositional gains, confirming efficacy of the multi-term objective (Zhang et al., 2023).

6. Limitations and Prospective Directions

Identified challenges and future research include:

  • Computational demands: Full construction of k-NN graphs for O(BK)O(BK) tokens is resource-intensive. Practical deployments may require efficient or approximate graph building techniques such as locality-sensitive hashing.
  • Hyperparameter sensitivity: Optimal settings for α\alpha, β\beta, γ\gamma, kk, σ\sigma, τginit\tau_g^{\mathrm{init}}, WUW_U must be tuned for datasets and architectures.
  • Beyond classification: Extending TRD to object detection, semantic segmentation, or temporal token graphs for video.
  • Graph topology learning: Instead of fixed kk-NN, learning the adjacency matrix during training.
  • Self-distillation and multiscale transfer: Applying token-level relations for intra-network feature alignment.
  • Cross-modal distillation: Adapting the framework to transfer relational information across modalities (e.g., image-text pairs in vision-LLMs).

TRD advances the progression from basic logit-based KD [Hinton et al., 2015] to feature and relation-based approaches, such as RKD [Park et al., 2019], IRG [Liu et al., 2019], and graph-based distillation [Zhou et al., 2021]. Distinct from prior work that either matches holistic features or instance relationships, TRD’s explicit modeling of token-level graphs enables new forms of fine-grained and spatially-aware knowledge transfer, especially applicable to architectures with patch- or region-based representations (e.g., ViT [Dosovitskiy et al., 2021]). Use of InfoNCE and contrastive paradigms is aligned with the research trajectory outlined in [Tian et al., 2019; Wang & Isola, 2020].

A plausible implication is that the token-centric relational framework may generalize to settings with structured semantic dependencies beyond image classification, suggesting future exploration into relational and multi-modal knowledge transfer regimes (Zhang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Token Relation Distillation (TRD).