Papers
Topics
Authors
Recent
Search
2000 character limit reached

Asymmetric Relational KD

Updated 8 May 2026
  • The paper demonstrates that ARKD improves distillation by aligning higher-order relational distributions using asymmetric, one-sided penalties.
  • It introduces relation-aware objectives that leverage temperature scaling and median-based geometric splits to regulate shrink and expand forces.
  • Empirical studies on CIFAR-100, STL-10, and TinyImageNet confirm ARKD's benefits in clustering, transfer performance, and robust model training.

Asymmetric Relational Knowledge Distillation (ARKD) refers to a class of knowledge distillation techniques designed to transfer structural or relational knowledge from a teacher model to a student model using asymmetric, relation-aware objectives. Unlike standard distillation relying exclusively on class-level distributions or per-sample feature alignment, ARKD matches higher-order similarities—e.g., inter-sample affinities or pairwise distances—while preserving critical geometric and affinity relationships encoded by the teacher. ARKD is particularly motivated by scenarios where fine-grained relational structure governs downstream utility, including both single- and multi-teacher distillation frameworks for compact or Mixture-of-Experts vision models (Giakoumoglou et al., 2024, Chaybouti et al., 23 Dec 2025).

1. Mathematical Formulation and Variants

Let fTf^T and fSf^S denote frozen teacher and trainable student encoders, respectively. Given anchor xix_i, embeddings are ziT=fT(xi)z_i^T = f^T(x_i), ziS=fS(xi)z_i^S = f^S(x_i). A queue QQ holds MM teacher embeddings {zkT}\{z_k^T\}. Defining pairwise similarity Ï•\phi (e.g., cosine on â„“2\ell_2-normalized features), one computes:

  • Teacher-side relational distribution:

fSf^S0

  • Student-side relational distribution:

fSf^S1

The ARKD loss aligns these via Kullback-Leibler divergence over samples:

fSf^S2

In practice, only the cross-entropy term fSf^S3 is back-propagated.

For each teacher fSf^S4, and batch of size fSf^S5:

  • Teacher summaries: fSf^S6
  • Student summaries: fSf^S7

Pairwise Euclidean distances:

fSf^S8

Normalized by teacher average:

fSf^S9

xix_i0

Median split xix_i1; then define

  • xix_i2
  • xix_i3

Weights: xix_i4, xix_i5.

Final ARKD loss (using smooth-L1 xix_i6):

xix_i7

2. Theoretical Foundations and Connections

ARKD is situated at the intersection of classical knowledge distillation (KD by Hinton et al.), contrastive (InfoNCE) learning, and relational knowledge distillation (RKD):

  • KD: Standard KL-based KD matches class-level soft distributions. ARKD generalizes this to non-class relational affinity, replacing class softmax with relational softmax over samples (Giakoumoglou et al., 2024).
  • InfoNCE: Contrastive learning pushes apart negatives and attracts positives. ARKD generalizes InfoNCE by matching distributions of similarities, not just single positive/negative splits. ARKD reduces to InfoNCE when memory holds one positive, and temperatures are equal.
  • Relational KD: Symmetric penalties may distort cluster geometry by encouraging both contraction and expansion. ARKD introduces asymmetry: only samples that should be close are penalized for being too far, and vice versa—preserving teacher-imposed geometric structure (Chaybouti et al., 23 Dec 2025).

This unification and relaxation yield more robust relational transfer than rigid instance discrimination or class-level alignment alone.

3. Asymmetry and Temperature Effects

ARKD introduces explicit asymmetry in its objectives, either via temperature (single-teacher) or geometric median splits (multi-teacher):

  • Temperature asymmetry: In (Giakoumoglou et al., 2024), xix_i8 (e.g., xix_i9, ziT=fT(xi)z_i^T = f^T(x_i)0). The student's relational distribution is sharpened, emphasizing the most salient relationships, while the teacher's is smoothed, retaining secondary affinities. This encourages the student to capture strong teacher invariants without discarding finer structure.
  • One-sided geometric regularization: In (Chaybouti et al., 23 Dec 2025), penalties are conditioned on whether the teacher considered samples close or far. There is no force for the student to bring samples closer if the teacher deemed them distant (and vice versa), avoiding overcompression or overspreading, with the decision boundary set by the median inter-sample distance.

A plausible implication is that this balance leads to better clustering and transfer characteristics than symmetric relational objectives, as supported by empirical findings.

4. Empirical Evidence and Experimental Protocol

  • Datasets: CIFAR-100, STL-10, TinyImageNet.
  • Architectures: ResNet-110ziT=fT(xi)z_i^T = f^T(x_i)1ResNet-20, WRN-40-2ziT=fT(xi)z_i^T = f^T(x_i)2WRN-16-2, as well as cross-family (ResNet-50ziT=fT(xi)z_i^T = f^T(x_i)3MobileNet-V2).
  • Prototype configuration: Memory bank ziT=fT(xi)z_i^T = f^T(x_i)416,384; batch size 64; projection head: linear, 128-dim, ziT=fT(xi)z_i^T = f^T(x_i)5-normalized.
  • Training regime: 240 epochs, SGD with momentum 0.9, weight decay ziT=fT(xi)z_i^T = f^T(x_i)6, cosine LR schedule, initial LR 0.05.

Results

CIFAR-100, Top-1 (%)

Method WRN-40-2ziT=fT(xi)z_i^T = f^T(x_i)7WRN-16-2 ResNet-110ziT=fT(xi)z_i^T = f^T(x_i)8ResNet-32
Vanilla 71.98 71.14
KD 73.54 73.08
CRD+KD 74.38 73.75
ARKD+KD 73.77 73.48

Cross-family

Method ResNet-50ziT=fT(xi)z_i^T = f^T(x_i)9MobileNet-V2 WRN-40-2ziS=fS(xi)z_i^S = f^S(x_i)0ShuffleNet-V1
KD 67.35 74.83
CRD+KD 69.54 76.27
ARKD+KD 69.13 76.31

Transfer (WRN-40-2ziS=fS(xi)z_i^S = f^S(x_i)1WRN-16-2 features)

Transfer KD CRD+KD ARKD+KD
C100ziS=fS(xi)z_i^S = f^S(x_i)2STL-10 70.9 72.2 72.0
C100ziS=fS(xi)z_i^S = f^S(x_i)3TIN-200 33.9 35.5 35.0
Method Img-Text Avg kNN Avg
Vanilla MT 63.71 81.57
RKD (sym) 77.48 81.36
ARKD 77.68 81.99

Symmetric RKD increases alignment but decreases kNN cluster quality; ARKD enhances both.

A similar trend is observed for retrieval benchmarks (MSCOCO5k, Flickr30k): ARKD adds consistent gains to both alignment and clustering over vanilla and symmetric alternatives.

5. Implementation and Integration Details

  • Batch size: 64; memory bank: 16,384 samples (8 MB for 128-dim features).
  • Loss weights: ziS=fS(xi)z_i^S = f^S(x_i)5–ziS=fS(xi)z_i^S = f^S(x_i)6 (ARKD), ziS=fS(xi)z_i^S = f^S(x_i)7 (KD), cross-entropy weight 1.
  • Optimizer: SGD (momentum 0.9), weight decay ziS=fS(xi)z_i^S = f^S(x_i)8, initial LR 0.05.
  • Projection head: 128-dim linear + ziS=fS(xi)z_i^S = f^S(x_i)9 normalization.
  • Overhead: QQ05.24 MFLOPs on ResNet-50 (QQ10.26%), QQ28 MB GPU memory.
  • ARKD computed each batch over all global teacher/student CLS summaries.
  • Pseudocode condenses the objective into steps: pairwise distance calculation, normalization, median-based mask for expand/shrink, applying smooth-L1 loss.
  • Loss is added unweighted to per-teacher global losses (cosine loss, MSE on patch/register tokens), with normalization by token count.
  • Token-balanced batching ensures fair image-token usage per batch.
  • Complete mixing of native-resolution images within token budgets; FlexAttention prevents cross-image attention.

6. Geometric and Learning Dynamics

ARKD acts as a regularizer on student geometry, stabilizing relational loss (notably for DINOv3) and enabling faster convergence in zero-shot alignment as evidenced in empirical studies (Chaybouti et al., 23 Dec 2025). By enforcing only one-sided penalties, ARKD respects the teacher's local and global structure—shrinking only for pairs that should be close, expanding only for pairs that should remain far. The asymmetric formulation prevents overcompression or over-scattering typically observed with naive symmetric relational KD. Empirical ablations indicate ARKD not only improves instance-level alignment (image-text, retrieval) but also preserves or enhances cluster quality (kNN), supporting its efficacy as a direct regularizer of geometry in both single- and multi-teacher protocols.

7. Significance and Applications

ARKD provides a principled method to enhance compact model training by transferring teacher relational structure, leading to more robust, transferable, and high-performing student models, especially in vision. It integrates seamlessly with existing KD objectives and workflows, incurs minor computational overhead, and is modular for use in both single- and multi-teacher (e.g., Mixture-of-Experts) regimes (Giakoumoglou et al., 2024, Chaybouti et al., 23 Dec 2025). Empirical results demonstrate consistent improvement over both vanilla and competitive relational distillation methods, validating the geometric and affinity-aware asymmetry at the heart of ARKD design.

Key references: "Relational Representation Distillation" (Giakoumoglou et al., 2024), "AMoE: Agglomerative Mixture-of-Experts Vision Foundation Model" (Chaybouti et al., 23 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Asymmetric Relational Knowledge Distillation (ARKD).