Token Relation Distillation (TRD)

Updated 19 March 2026

Token Relation Distillation (TRD) is a method that transfers fine-grained token-level relational structures from a teacher model to a smaller student, improving semantic alignment.
It constructs a Token-level Relationship Graph (TRG) to capture both intra-instance context and cross-instance token similarities, enabling robust and spatially-aware knowledge transfer.
Empirical results demonstrate that TRD boosts accuracy and robustness in both CNN and ViT architectures, especially under imbalanced classification scenarios.

Token Relation Distillation (TRD) introduces a methodology for enhancing knowledge distillation by explicitly transferring the token-level relational structure learned by a powerful teacher network to a typically smaller student model. Departing from conventional distillation approaches that emphasize either logits or instance-level relationships, TRD leverages a Token-level Relationship Graph (TRG) to encapsulate both intra-instance semantic context and cross-instance token-wise similarities. This graph-centric strategy enables the student to emulate fine-grained, higher-order semantic dependencies from the teacher, with demonstrated advantages on balanced and imbalanced classification tasks across CNN and ViT architectures (Zhang et al., 2023).

1. Motivation and Theoretical Foundations

Traditional knowledge distillation (KD) as introduced by Hinton et al. (Hinton et al., 2015) focuses on transferring class probability distributions through softened logits:

$L_{KD} = \frac{1}{N} \sum_{i=1}^N \mathrm{KL}(p_i^S(\tau) \| p_i^T(\tau)),$

yielding the standard loss formulation:

$L_\text{logit} = L_\text{CE}(p^S, y) + \lambda L_{KD}.$

Extensions incorporating feature-based alignment or instance-level relational graphs (e.g., RKD [Park et al., 2019], IRG [Liu et al., 2019]) improved transfer, but failed to explicitly model intra-image structure and higher-order semantic patterns.

The central hypothesis motivating TRD is that transferring the rich, token-level relational information—particularly relevant for patch-based architectures (e.g., ViT [Dosovitskiy et al., 2021]) or feature-aggregating CNNs—enables more complete knowledge transfer. This is especially beneficial in long-tailed settings, where rare classes may share semantic micro-patterns best captured at the token level.

Token-level relationships encode:

Inner-instance semantic context: How patches or regions within an image are related.
Cross-instance patch-to-patch similarity: How a token in one image relates to tokens in others.

By explicitly distilling this graph-structured data, TRD aims to bridge the capability gap between teacher and student, surpassing instance- or feature-level approaches (Zhang et al., 2023).

2. Construction of the Token-level Relationship Graph (TRG)

Token Representation

ViT-like networks: Images $x \in \mathbb{R}^{C \times H \times W}$ are partitioned into $M = HW/P^2$ patches $x_p \in \mathbb{R}^{M \times D}$ with $D = P^2 C$ . Teacher tokens $T^T \in \mathbb{R}^{B \cdot M \times d_T}$ , student tokens $T^S \in \mathbb{R}^{B \cdot M \times d_S}$ .
CNN-like networks: The penultimate feature map $F \in \mathbb{R}^{C_\ell \times H_\ell \times W_\ell}$ is split into $M$ patches $\rightarrow$ tokens $T \in \mathbb{R}^{M \times d}$ .

Every token, whether from teacher or student, is a $d$ -dimensional vector $h_i$ used as a node attribute in the graph.

Random Token Sampling

A full batch contains $B \times M$ tokens, often computationally unwieldy. TRD samples $K$ tokens per image using a shared random mask for both models, yielding $N = B \cdot K$ tokens for both teacher and student:

$T^T = \{ h^T_i \}_{i=1}^N, \quad T^S = \{ h^S_i \}_{i=1}^N.$

Graph Construction

Two attributed graphs, $G^T = (V, A^T)$ and $G^S = (V, A^S)$ , share vertex set $V = \{1, ..., N\}$ , with adjacency matrices $A^T, A^S \in \mathbb{R}^{N \times N}$ . For $i \neq j$ :

$A^\{\cdot\}_{ij} = \begin{cases} \exp \left(- \frac{\|h_i - h_j\|^2}{2\sigma^2} \right) & \text{if } i \in k\text{NN}(j) \text{ or } j \in k\text{NN}(i) \ 0 & \text{otherwise} \end{cases}$

A dense token-wise relational similarity matrix $R$ can also be formed:

$R_{ij} = f(h_i, h_j) = \cos(h_i, h_j) = \frac{h_i \cdot h_j}{\|h_i\| \|h_j\|},$

optionally normalized:

$P_{ij} = \mathrm{Softmax}_j(R_{ij}/\tau_R).$

3. TRG-based Distillation Objectives

The total loss combines logit-based KD with several graph and token-level objectives:

$L_\text{total} = L_\text{logit} + \alpha L_\text{inner} + \beta L_\text{local} + \gamma L_\text{global}$

with hyperparameters $\lambda, \alpha, \beta, \gamma, \tau, \tau_g^{\mathrm{init}}, W_U$ .

3.1 Local Preserving Loss

This term matches local neighborhood structure between student and teacher graphs:

$L_\text{local} = \sum_{i=1}^N \mathrm{KL}(\mathrm{Softmax}_j(A^S_{ij}) \,\|\, \mathrm{Softmax}_j(A^T_{ij})).$

3.2 Global Contrastive Loss

Employing an InfoNCE objective, corresponding tokens across teacher and student are aligned, with negatives pushed apart. Student tokens are projected to teacher dimension if $d_S \neq d_T$ :

$\mathrm{SIM}(h_i^S, h_j^T) = \frac{\mathrm{Proj}(h_i^S) \cdot h_j^T}{\|\mathrm{Proj}(h_i^S)\| \|h_j^T\|}$

$L_\text{global} = - \sum_{i=1}^N \log \frac{\exp(\mathrm{SIM}(h_i^S, h_i^T)/\tau_g)}{\sum_{j=1}^N \exp(\mathrm{SIM}(h_i^S, h_j^T)/\tau_g)}$

A dynamic temperature $\tau_g(e)$ , where $e$ is the epoch, is adopted:

$\tau_g(e) = \begin{cases} \tau_g^{\mathrm{init}}, & \text{if } e \leq W_U \ \tau_g^{\mathrm{init}} / \log_{W_U}(e), & \text{if } e > W_U \end{cases}$

3.3 Token-wise Contextual Loss

To transfer inner-instance semantic context, the method matches self-similarity matrices of patch tokens within each image. For penultimate feature map $F$ :

$\mathrm{CS} = \mathrm{Softmax}\left(\frac{F F^\top}{\sqrt{d}} \right), \quad \mathrm{CS}^T, \mathrm{CS}^S \in \mathbb{R}^{M \times M}$

$L_\text{inner} = \|\mathrm{CS}^T - \mathrm{CS}^S\|^2_F$

This constrains the student to preserve internal patch arrangements consistent with the teacher.

4. Empirical Evaluation

Datasets and Architectures

Datasets: CIFAR-100, CIFAR-100-LT (imbalance ratios 10, 50, 100), ImageNet-1K, ImageNet-LT (imbalance 10).
CNN-based models: ResNet-32×4 → ResNet-8×4, ResNet56 → ResNet20, VGG13 → VGG8, WRN-40-2 → WRN-40-1, ShuffleNet → MobileNet.
ViT-based models: DeiT-Tiny/Small students, ResNet-101 or CeiT-Base teachers.

Training Configurations

CIFAR-100: SGD with Nesterov ($0.9$ momentum, $5\times 10^{-4}$ weight decay), 240 epochs, reduced LR at 150, 180, 210. $\tau=4$ , $\tau_g^{\mathrm{init}}=0.1$ , warm-up $W_U=15$ .
ImageNet & ViTs: CNNs – 100 epochs, cosine LR schedule, batch 128×4. ViTs – AdamW, LR= $5\times10^{-4}$ , 200 epochs, 10-epoch warm-up. All experiments on 4×RTX3090 GPUs.

Performance Summary

TRD achieves superior accuracy and robustness, notably:

Setting	TRD Top-1 (%)	Strongest Baseline (%)
ResNet-32×4 → 8×4 (CIFAR)	76.42	HKD: 76.21, KD: 74.12
ShuffleV1←ResNet32×4	76.42	DKD: 76.42, HKD: 75.99
ResNet34→ResNet18 (ImageNet)	71.31	KD +1.56%
DeiT-Tiny←ResNet101	75.5	KD: 74.8, HKD: 75.2
DeiT-Small←CeiT-Base	81.8	HKD: 81.3

On long-tailed variants:

CIFAR-100-LT: TRD degrades less as imbalance increases and can surpass teacher accuracy.
ImageNet-LT: TRD Top-1 50.32% ( $-20.99\%$ drop) vs. KD 46.70% ( $-24.00\%$ drop).

Ablation studies confirm each loss ( $L_\text{inner}$ , $L_\text{local}$ , $L_\text{global}$ ) contributes $0.3\%$ – $1.0\%$ accuracy, with token-level graphs outperforming instance-level graphs. Dynamic $\tau_g$ provides smoother optimization and lower embedding divergence.

5. Analysis: Contextualization and Visualizations

Several key insights arise from empirical analysis:

Larger batch sizes (e.g., 512 tokens) strengthen the representational capacity and graph structure, yielding marginal accuracy improvements.
t-SNE projections show TRD features are more class-separable than those from KD, IRG, or HKD, indicating successful transfer of fine-grained relational information.
The addition of each loss component leads to measurable, compositional gains, confirming efficacy of the multi-term objective (Zhang et al., 2023).

6. Limitations and Prospective Directions

Identified challenges and future research include:

Computational demands: Full construction of k-NN graphs for $O(BK)$ tokens is resource-intensive. Practical deployments may require efficient or approximate graph building techniques such as locality-sensitive hashing.
Hyperparameter sensitivity: Optimal settings for $\alpha$ , $\beta$ , $\gamma$ , $k$ , $\sigma$ , $\tau_g^{\mathrm{init}}$ , $W_U$ must be tuned for datasets and architectures.
Beyond classification: Extending TRD to object detection, semantic segmentation, or temporal token graphs for video.
Graph topology learning: Instead of fixed $k$ -NN, learning the adjacency matrix during training.
Self-distillation and multiscale transfer: Applying token-level relations for intra-network feature alignment.
Cross-modal distillation: Adapting the framework to transfer relational information across modalities (e.g., image-text pairs in vision-LLMs).

TRD advances the progression from basic logit-based KD [Hinton et al., 2015] to feature and relation-based approaches, such as RKD [Park et al., 2019], IRG [Liu et al., 2019], and graph-based distillation [Zhou et al., 2021]. Distinct from prior work that either matches holistic features or instance relationships, TRD’s explicit modeling of token-level graphs enables new forms of fine-grained and spatially-aware knowledge transfer, especially applicable to architectures with patch- or region-based representations (e.g., ViT [Dosovitskiy et al., 2021]). Use of InfoNCE and contrastive paradigms is aligned with the research trajectory outlined in [Tian et al., 2019; Wang & Isola, 2020].

A plausible implication is that the token-centric relational framework may generalize to settings with structured semantic dependencies beyond image classification, suggesting future exploration into relational and multi-modal knowledge transfer regimes (Zhang et al., 2023).

Markdown Report Issue Upgrade to Chat

References (2)

Knowledge Distillation via Token-level Relationship Graph (2023)

Distilling the Knowledge in a Neural Network (2015)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Token Relation Distillation (TRD).

Token Relation Distillation (TRD)

1. Motivation and Theoretical Foundations

2. Construction of the Token-level Relationship Graph (TRG)

Token Representation

Random Token Sampling

Graph Construction

3. TRG-based Distillation Objectives

3.1 Local Preserving Loss

3.2 Global Contrastive Loss

3.3 Token-wise Contextual Loss

4. Empirical Evaluation

Datasets and Architectures

Training Configurations

Performance Summary

5. Analysis: Contextualization and Visualizations

6. Limitations and Prospective Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Token Relation Distillation (TRD)

1. Motivation and Theoretical Foundations

2. Construction of the Token-level Relationship Graph (TRG)

Token Representation

Random Token Sampling

Graph Construction

3. TRG-based Distillation Objectives

3.1 Local Preserving Loss

3.2 Global Contrastive Loss

3.3 Token-wise Contextual Loss

4. Empirical Evaluation

Datasets and Architectures

Training Configurations

Performance Summary

5. Analysis: Contextualization and Visualizations

6. Limitations and Prospective Directions

7. Related Work and Positioning

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research