Cross-Architecture Knowledge Distillation
- Cross-Architecture Knowledge Distillation (CAKD) transfers knowledge between teacher and student models with fundamentally different architectures, bridging representational and inductive biases.
- Innovative techniques like Unified Receptive Field Mapping and Frequency-Domain Transfer align heterogeneous features to overcome spatial and semantic mismatches.
- Empirical results show CAKD improves performance in vision, language, and multimodal tasks, enabling efficient model deployment in resource-constrained environments.
Cross-Architecture Knowledge Distillation (CAKD) refers to a suite of methodologies designed to transfer knowledge from a high-capacity teacher model to a student model with a fundamentally different architecture (e.g., Transformer→CNN, CNN→ViT, MLP→CNN, or even across tokenizers and modalities). Unlike classical knowledge distillation, which assumes a shared architectural or representational paradigm between teacher and student, CAKD seeks to bridge inductive, spatial, and statistical mismatches—enabling deployment of performant yet efficient models suitable for resource-constrained environments and heterogeneous inference scenarios.
1. Foundations and Challenges in Cross-Architecture Distillation
CAKD faces obstacles not present in homogeneous-architecture distillation due to representational and operational diversity:
- Feature Space Misalignment: Teacher features (e.g., ViT tokens, CNN feature grids) differ in spatial/semantic granularity, channel structure, and aggregation mechanisms (Zhao et al., 2023, Yu et al., 28 Oct 2025). A naïve or KL alignment can degrade student performance (Yu et al., 28 Oct 2025, Zhang et al., 29 Jul 2025).
- Inductive Bias Disparity: Transformers encode global interactions via attention, while CNNs emphasize local convolutional context; MLPs and SSMs introduce yet other bias patterns (Liu et al., 2022, Moudgil et al., 1 Apr 2026).
- Mismatch in Output Statistics: Different model families produce outputs at varying scales and distributional spreads, requiring specialized objective construction (Hofstätter et al., 2020, Zhang et al., 29 Jul 2025).
- Teacher Inadaptability to Distillation: Pre-trained teachers lack mechanisms to focus on student-relevant knowledge without loss of discriminative power (Zhao et al., 2023, Lin et al., 15 Jan 2025).
- Unification for Non-Vision Domains: Especially in LLMs, sequence misalignment and vocabulary mismatch compound representational differences, needing dynamic alignment across tokenizers or modalities (Chen et al., 16 Feb 2025, Zhang et al., 29 Apr 2026).
2. Architectures and Distillation Paradigms
Recent CAKD frameworks support a broad range of heterogeneous teacher-student configurations:
- Computer Vision: Vision Transformers (e.g., ViT, Swin) distilled into efficient CNNs (e.g., MobileNet, ResNet) for face recognition, medical imaging, leaf disease classification, and real-time video models (Zhao et al., 2023, Le et al., 2 May 2026, Yilmaz et al., 23 Jun 2025, Peng et al., 12 Nov 2025).
- NLP and LLMs: Transformer LLMs into small Transformer, Mamba (SSM), or diffusion-LLM architectures, handling independent tokenizers or parallel decoding pathways (Moudgil et al., 1 Apr 2026, Zhang et al., 29 Apr 2026, Singh et al., 29 Sep 2025, Chen et al., 16 Feb 2025).
- Multimodal and Specialized Domains: Artificial Neural Networks (ANNs) to Spiking Neural Networks (SNNs) (Ye et al., 12 Jul 2025); CNNs to Transformers or vice versa for segmentation, medical image analysis (Huang et al., 10 Apr 2025, Zheng et al., 2024).
3. Core Methodological Innovations
a. Feature and Representation Alignment
- Unified Receptive Field Mapping (URFM): Projects heterogeneous features onto a shared set of learnable local centers, harmonizing receptive fields via task-aware positional encoding (Facial Positional Encoding, FPE) (Zhao et al., 2023).
- Frequency-Domain Transfer (UHKD): Applies 2D Fourier transforms to abstract away spatial bias, reducing architecture-specific semantics to a common spectral representation, aligned via lightweight adapters (Yu et al., 28 Oct 2025).
- Redundancy Suppression (RSD): Maximizes invariance and decorrelates features across architectures using a batch-normalized Pearson correlation matrix objective, distilling only architecture-agnostic knowledge (Zhang et al., 29 Jul 2025).
- Attention-Map & Groupwise Linear Projectors (PCA/GL): Partial cross-attention and groupwise projections allow spatial and token-based correspondence (student→teacher) without full representational matching (Liu et al., 2022, Le et al., 2 May 2026, Yilmaz et al., 23 Jun 2025).
- Region-Aware Attention (RAA): Self-attention over patchified, multi-stage student features enables spatial and semantic “view” alignment to Transformer/MLP perspectives (Lin et al., 15 Jan 2025).
b. Teacher Adaptation and Specialization
- Prompt Tuning for Teachers (APT/AFP): Frozen Transformer teachers receive a bank of learnable prompt tokens (or prompt blocks) optimized during distillation, thereby specializing for student-relevant features and preventing collapse into trivial self-distillation (Zhao et al., 2023, Lin et al., 15 Jan 2025).
- Dual-Teacher and Knowledge Mixing: Systems use both a heterogeneous (e.g., ViT) and homogeneous (e.g., CNN) teacher, with student supervision adaptively fused via discrepancy/confidence-aware weighting; residual features between teachers highlight tranferable inductive biases (Peng et al., 12 Nov 2025).
c. Relational, Logit-, and Frequency-Level Objectives
- Decoupled Relational Alignment (DFRA): Simultaneously aligns inter-class and inter-sample similarity structures in both logits and projected feature-level spaces, balancing dark knowledge with classification confidence (Yang et al., 10 Feb 2025).
- Margin-MSE for Ranking: In dense/sparse retrieval, margin-matching of teacher and student score differences ensures distortion-robust learning across output scales (Hofstätter et al., 2020).
- Dynamic Loss Scheduling and Cross-Tokenizer Alignment: In LLMs/dLLMs, TIDAL adaptively modulates distillation strength by diffusion time and training progress, while sequence and vocabulary misalignments are mitigated by dynamic mapping or chunk-level Reverse CALM objectives (Chen et al., 16 Feb 2025, Zhang et al., 29 Apr 2026).
4. Training Pipelines, Loss Functions, and Implementation
Common frameworks optimize a multi-term loss over teacher-frozen, student-learned, or prompt-learned parameter sets:
- Vision (URFM + APT):
where is the ArcFace loss (for face or general ID), matches attention maps, aligns URFM features (Zhao et al., 2023).
- Frequency-domain:
with frequency-domain MSE, KL divergence on softmax logits, and standard CE (Yu et al., 28 Oct 2025).
- Dual-teacher video:
where is a discrepancy-weighted KL, is masked residual MSE, is relational KD (Peng et al., 12 Nov 2025).
- Text/LLM/dLLM:
- Contextual Dynamic Mapping w/ entropy-weighted DTW and online vocabulary mapping aligns logit tensors before KL (Chen et al., 16 Feb 2025).
- Preference optimization fuses supervised and odds-ratio contrastive loss over diverse reasoning traces (Singh et al., 29 Sep 2025).
- TIDE applies temporally modulated KL, with CompDemo and Reverse CALM chunk-level matching for tokenizer-heterogeneous pipelines (Zhang et al., 29 Apr 2026).
- Retrieval:
0
margin-focused MSE for ranking, robust to output-scale mismatches (Hofstätter et al., 2020).
5. Empirical Performance and Practical Outcomes
CAKD methods consistently outperform classical KD and early cross-architecture baselines (e.g., OFA-KD, logit-only methods) across domains:
| Setting / Metric | Prior SOTA | CAKD Result | Gain | Reference |
|---|---|---|---|---|
| CIFAR-100 Top-1 (%) (UHKD) | 79.84 | 82.09 (+2.25) | +2.25 | (Yu et al., 28 Oct 2025) |
| ImageNet Top-1 (%) (RSD) | 74.7 | 77.0 (+2.3) | +2.3 | (Zhang et al., 29 Jul 2025) |
| LFW Face Verification (URFM+APT) | 99.52 | 99.61 | +0.09 | (Zhao et al., 2023) |
| HMDB51 (Video Dual-Teacher) | 73.66 | 77.06 (+3.4) | +3.4 | (Peng et al., 12 Nov 2025) |
| HumanEval Code (TIDE dLLM) | 32.3 (AR base) | 48.78 | +16.5 | (Zhang et al., 29 Apr 2026) |
Practical deployment is demonstrated on platforms such as NVIDIA Jetson Nano for medical imaging (Yilmaz et al., 23 Jun 2025), TFLite/ONNX/TensorRT for leaf disease edge-detection (Le et al., 2 May 2026), and low-memory SNNs for neuromorphic vision (Ye et al., 12 Jul 2025). Empirical efficiency gains include 172× parameter reductions, 50–75% training time savings (CrossAdapt), and several-fold reductions in energy and latency (Le et al., 2 May 2026, Wu et al., 2 Feb 2026).
6. Extensions Beyond Vision: Language and Multimodal CAKD
- Tokenizer and Sequence-Agnostic KD: CDM leverages entropy-weighted dynamic mapping for cross-tokenizer distillation in LLMs, supporting instruction following, code, and math tasks in teacher-student pairs with minimal sequence and vocabulary overlap (Chen et al., 16 Feb 2025).
- Diffusion LLMs: TIDE enables the first cross-architecture distillation of dLLMs by temporally modulating the KL, enriching masked context, and aligning across non-overlapping vocabularies (Zhang et al., 29 Apr 2026).
- LLM Preference Optimization: ORPO preferentially transfers teacher reasoning traces, contrasting against student-generated negatives, and combining off- and on-policy negatives for improved generalization across architectures (Singh et al., 29 Sep 2025).
- ANN→SNN Distillation: Multi-stream, domain-aligned, and phased KD with semantic replacement yields large SNN performance gains on event-based vision (Ye et al., 12 Jul 2025).
7. Best Practices, Limitations, and Open Directions
- Loss Aggregation: Multi-term objectives (feature/intermediate, output/logit, relation) consistently yield stronger, more robust transfer than output-only KL (Yu et al., 28 Oct 2025, Yang et al., 10 Feb 2025).
- Prompt and Adapter Efficiency: Teacher-side prompt tuning and lightweight student-side adapters balance adaptability with parameter efficiency (Zhao et al., 2023, Lin et al., 15 Jan 2025, Peng et al., 12 Nov 2025).
- Alignment Granularity: Spatial, spectral, or token-wise mapping should be architecturally sensitive (e.g., Fourier for global content, PCA/FPE for spatial alignment) (Yu et al., 28 Oct 2025, Zhao et al., 2023, Le et al., 2 May 2026).
- Deployment and Generalization: Empirical best practices include matching spatial layouts, validating on realistic hardware, using small-scale hyperparameter search for cross-domain transfer, and verifying with interpretability tools (e.g., Grad-CAM) (Le et al., 2 May 2026, Yilmaz et al., 23 Jun 2025).
- Limitations: 2D/3D spatial layout is typically ignored in KD for non-vision tasks (Zhang et al., 29 Jul 2025); continual adaptation or extension to extreme heterogeneous pairs (e.g., 70B→0.5B LLMs, open-vocab scenarios) require further innovation (Singh et al., 29 Sep 2025, Zhang et al., 29 Apr 2026).
In summary, cross-architecture knowledge distillation represents a mature, theoretically grounded, and empirically validated paradigm for bridging the capabilities of heterogeneous model families, extending state-of-the-art performance to memory- or resource-constrained student models in both vision and language domains. Continued advances are expected in scaling to larger models, improved alignment of multi-modal or non-Euclidean representations, and integration with online, continual, and streaming-lifecycle settings.