Asymmetric Relational KD

Updated 8 May 2026

The paper demonstrates that ARKD improves distillation by aligning higher-order relational distributions using asymmetric, one-sided penalties.
It introduces relation-aware objectives that leverage temperature scaling and median-based geometric splits to regulate shrink and expand forces.
Empirical studies on CIFAR-100, STL-10, and TinyImageNet confirm ARKD's benefits in clustering, transfer performance, and robust model training.

Asymmetric Relational Knowledge Distillation (ARKD) refers to a class of knowledge distillation techniques designed to transfer structural or relational knowledge from a teacher model to a student model using asymmetric, relation-aware objectives. Unlike standard distillation relying exclusively on class-level distributions or per-sample feature alignment, ARKD matches higher-order similarities—e.g., inter-sample affinities or pairwise distances—while preserving critical geometric and affinity relationships encoded by the teacher. ARKD is particularly motivated by scenarios where fine-grained relational structure governs downstream utility, including both single- and multi-teacher distillation frameworks for compact or Mixture-of-Experts vision models (Giakoumoglou et al., 2024, Chaybouti et al., 23 Dec 2025).

1. Mathematical Formulation and Variants

Let $f^T$ and $f^S$ denote frozen teacher and trainable student encoders, respectively. Given anchor $x_i$ , embeddings are $z_i^T = f^T(x_i)$ , $z_i^S = f^S(x_i)$ . A queue $Q$ holds $M$ teacher embeddings $\{z_k^T\}$ . Defining pairwise similarity $\phi$ (e.g., cosine on $\ell_2$ -normalized features), one computes:

Teacher-side relational distribution:

$f^S$ 0

Student-side relational distribution:

$f^S$ 1

The ARKD loss aligns these via Kullback-Leibler divergence over samples:

$f^S$ 2

In practice, only the cross-entropy term $f^S$ 3 is back-propagated.

For each teacher $f^S$ 4, and batch of size $f^S$ 5:

Teacher summaries: $f^S$ 6
Student summaries: $f^S$ 7

Pairwise Euclidean distances:

$f^S$ 8

Normalized by teacher average:

$f^S$ 9

$x_i$ 0

Median split $x_i$ 1; then define

$x_i$ 2
$x_i$ 3

Weights: $x_i$ 4, $x_i$ 5.

Final ARKD loss (using smooth-L1 $x_i$ 6):

$x_i$ 7

2. Theoretical Foundations and Connections

ARKD is situated at the intersection of classical knowledge distillation (KD by Hinton et al.), contrastive (InfoNCE) learning, and relational knowledge distillation (RKD):

KD: Standard KL-based KD matches class-level soft distributions. ARKD generalizes this to non-class relational affinity, replacing class softmax with relational softmax over samples (Giakoumoglou et al., 2024).
InfoNCE: Contrastive learning pushes apart negatives and attracts positives. ARKD generalizes InfoNCE by matching distributions of similarities, not just single positive/negative splits. ARKD reduces to InfoNCE when memory holds one positive, and temperatures are equal.
Relational KD: Symmetric penalties may distort cluster geometry by encouraging both contraction and expansion. ARKD introduces asymmetry: only samples that should be close are penalized for being too far, and vice versa—preserving teacher-imposed geometric structure (Chaybouti et al., 23 Dec 2025).

This unification and relaxation yield more robust relational transfer than rigid instance discrimination or class-level alignment alone.

3. Asymmetry and Temperature Effects

ARKD introduces explicit asymmetry in its objectives, either via temperature (single-teacher) or geometric median splits (multi-teacher):

Temperature asymmetry: In (Giakoumoglou et al., 2024), $x_i$ 8 (e.g., $x_i$ 9, $z_i^T = f^T(x_i)$ 0). The student's relational distribution is sharpened, emphasizing the most salient relationships, while the teacher's is smoothed, retaining secondary affinities. This encourages the student to capture strong teacher invariants without discarding finer structure.
One-sided geometric regularization: In (Chaybouti et al., 23 Dec 2025), penalties are conditioned on whether the teacher considered samples close or far. There is no force for the student to bring samples closer if the teacher deemed them distant (and vice versa), avoiding overcompression or overspreading, with the decision boundary set by the median inter-sample distance.

A plausible implication is that this balance leads to better clustering and transfer characteristics than symmetric relational objectives, as supported by empirical findings.

4. Empirical Evidence and Experimental Protocol

Datasets: CIFAR-100, STL-10, TinyImageNet.
Architectures: ResNet-110 $z_i^T = f^T(x_i)$ 1ResNet-20, WRN-40-2 $z_i^T = f^T(x_i)$ 2WRN-16-2, as well as cross-family (ResNet-50 $z_i^T = f^T(x_i)$ 3MobileNet-V2).
Prototype configuration: Memory bank $z_i^T = f^T(x_i)$ 416,384; batch size 64; projection head: linear, 128-dim, $z_i^T = f^T(x_i)$ 5-normalized.
Training regime: 240 epochs, SGD with momentum 0.9, weight decay $z_i^T = f^T(x_i)$ 6, cosine LR schedule, initial LR 0.05.

Results

CIFAR-100, Top-1 (%)

Method	WRN-40-2 $z_i^T = f^T(x_i)$ 7WRN-16-2	ResNet-110 $z_i^T = f^T(x_i)$ 8ResNet-32
Vanilla	71.98	71.14
KD	73.54	73.08
CRD+KD	74.38	73.75
ARKD+KD	73.77	73.48

Cross-family

Method	ResNet-50 $z_i^T = f^T(x_i)$ 9MobileNet-V2	WRN-40-2 $z_i^S = f^S(x_i)$ 0ShuffleNet-V1
KD	67.35	74.83
CRD+KD	69.54	76.27
ARKD+KD	69.13	76.31

Transfer (WRN-40-2 $z_i^S = f^S(x_i)$ 1WRN-16-2 features)

Transfer	KD	CRD+KD	ARKD+KD
C100 $z_i^S = f^S(x_i)$ 2STL-10	70.9	72.2	72.0
C100 $z_i^S = f^S(x_i)$ 3TIN-200	33.9	35.5	35.0

Token-balanced batching, PHI-S scaling, two-stage distillation.
Empirical ablation: On image-text classification (DINOv3 head, $z_i^S = f^S(x_i)$ 4 resolution):

Method	Img-Text Avg	kNN Avg
Vanilla MT	63.71	81.57
RKD (sym)	77.48	81.36
ARKD	77.68	81.99

Symmetric RKD increases alignment but decreases kNN cluster quality; ARKD enhances both.

A similar trend is observed for retrieval benchmarks (MSCOCO5k, Flickr30k): ARKD adds consistent gains to both alignment and clustering over vanilla and symmetric alternatives.

5. Implementation and Integration Details

Batch size: 64; memory bank: 16,384 samples (8 MB for 128-dim features).
Loss weights: $z_i^S = f^S(x_i)$ 5– $z_i^S = f^S(x_i)$ 6 (ARKD), $z_i^S = f^S(x_i)$ 7 (KD), cross-entropy weight 1.
Optimizer: SGD (momentum 0.9), weight decay $z_i^S = f^S(x_i)$ 8, initial LR 0.05.
Projection head: 128-dim linear + $z_i^S = f^S(x_i)$ 9 normalization.
Overhead: $Q$ 05.24 MFLOPs on ResNet-50 ( $Q$ 10.26%), $Q$ 28 MB GPU memory.

ARKD computed each batch over all global teacher/student CLS summaries.
Pseudocode condenses the objective into steps: pairwise distance calculation, normalization, median-based mask for expand/shrink, applying smooth-L1 loss.
Loss is added unweighted to per-teacher global losses (cosine loss, MSE on patch/register tokens), with normalization by token count.
Token-balanced batching ensures fair image-token usage per batch.
Complete mixing of native-resolution images within token budgets; FlexAttention prevents cross-image attention.

6. Geometric and Learning Dynamics

ARKD acts as a regularizer on student geometry, stabilizing relational loss (notably for DINOv3) and enabling faster convergence in zero-shot alignment as evidenced in empirical studies (Chaybouti et al., 23 Dec 2025). By enforcing only one-sided penalties, ARKD respects the teacher's local and global structure—shrinking only for pairs that should be close, expanding only for pairs that should remain far. The asymmetric formulation prevents overcompression or over-scattering typically observed with naive symmetric relational KD. Empirical ablations indicate ARKD not only improves instance-level alignment (image-text, retrieval) but also preserves or enhances cluster quality (kNN), supporting its efficacy as a direct regularizer of geometry in both single- and multi-teacher protocols.

7. Significance and Applications

ARKD provides a principled method to enhance compact model training by transferring teacher relational structure, leading to more robust, transferable, and high-performing student models, especially in vision. It integrates seamlessly with existing KD objectives and workflows, incurs minor computational overhead, and is modular for use in both single- and multi-teacher (e.g., Mixture-of-Experts) regimes (Giakoumoglou et al., 2024, Chaybouti et al., 23 Dec 2025). Empirical results demonstrate consistent improvement over both vanilla and competitive relational distillation methods, validating the geometric and affinity-aware asymmetry at the heart of ARKD design.

Key references: "Relational Representation Distillation" (Giakoumoglou et al., 2024), "AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model" (Chaybouti et al., 23 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Relational Representation Distillation (2024)

AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Asymmetric Relational Knowledge Distillation (ARKD).

Asymmetric Relational KD

1. Mathematical Formulation and Variants

Relational Distribution Matching (Single-Teacher, (Giakoumoglou et al., 2024))

Asymmetric Pairwise Distance Alignment (Multi-Teacher, (Chaybouti et al., 23 Dec 2025))

2. Theoretical Foundations and Connections

3. Asymmetry and Temperature Effects

4. Empirical Evidence and Experimental Protocol

Setup ((Giakoumoglou et al., 2024), single-teacher)

Results

Multi-teacher (AMoE, (Chaybouti et al., 23 Dec 2025))

5. Implementation and Integration Details

Computational Aspects (Giakoumoglou et al., 2024)

Multi-teacher Integration (Chaybouti et al., 23 Dec 2025)

6. Geometric and Learning Dynamics

7. Significance and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Asymmetric Relational KD

1. Mathematical Formulation and Variants

Relational Distribution Matching (Single-Teacher, (Giakoumoglou et al., 2024))

Asymmetric Pairwise Distance Alignment (Multi-Teacher, (Chaybouti et al., 23 Dec 2025))

2. Theoretical Foundations and Connections

3. Asymmetry and Temperature Effects

4. Empirical Evidence and Experimental Protocol

Setup ((Giakoumoglou et al., 2024), single-teacher)

Results

Multi-teacher (AMoE, (Chaybouti et al., 23 Dec 2025))

5. Implementation and Integration Details

Computational Aspects (Giakoumoglou et al., 2024)

Multi-teacher Integration (Chaybouti et al., 23 Dec 2025)

6. Geometric and Learning Dynamics

7. Significance and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research