MoE Distillation: Aggregating Expert Knowledge

Updated 6 April 2026

Mixture-of-Experts Distillation is a technique that transfers the diverse, specialized knowledge from sparse expert subnetworks to a student model while addressing routing challenges.
It employs strategies like Knowledge Augmentation and Student-Aware Router to integrate both activated and non-activated expert signals for robust knowledge transfer.
Applications across language, vision, and federated learning demonstrate its effectiveness in improving model accuracy, efficiency, and overall robustness.

Mixture-of-Experts Distillation is a paradigm that transfers or aggregates knowledge from neural network models architected as mixtures of experts (MoE), aiming to compress, adapt, or generalize large expert-based systems into more efficient or more robust student models. The approach addresses unique challenges posed by the sparse expert routing in MoE models, the loss of activated and non-activated expert capacity in standard knowledge distillation, and the need to synthesize diverse, specialized knowledge from multiple experts or even heterogeneous models. Advanced forms of MoE distillation span domains from language modeling and vision to federated and cross-modal learning, incorporating tailored aggregation, routing, and loss design to maximize the transfer of the MoE's distinct inductive benefits.

1. Fundamentals of Mixture-of-Experts Architectures

Mixture-of-Experts (MoE) architectures combine multiple expert subnetworks with a gating or router network that assigns each input to one or more experts. The general MoE layer formalism is:

For input representation $x \in \mathbb{R}^d$ $x \in R^{d}$ :
- The router computes logits $h(x) \in \mathbb{R}^M$ for $M$ experts, normalized to gating weights $g_i(x) = \mathrm{softmax}(h(x))_i$ .
- In sparse MoE (most common), Top- $k$ routing selects only $k$ highest $h(x)_i$ per input while the rest are zeroed before softmax.
- Each expert $E_i$ produces $E_i(x) \in \mathbb{R}^d$ .
- The MoE output is $y(x) = \sum_{i=1}^M g_i(x) \cdot E_i(x)$ (Kim et al., 18 Feb 2025).

MoE enables parameter-efficient scaling, high expressivity, and subnetwork specialization. However, such architectures inflate serving costs and induce loss of knowledge in conventional knowledge distillation (KD) since standard KD only distills the activated expert ensemble per input, ignoring the rich—and empirically useful—knowledge encapsulated in non-activated experts (Kim et al., 18 Feb 2025).

2. MoE Distillation: Objectives, Failures, and Challenges

Standard knowledge distillation minimizes the Kullback–Leibler divergence between the teacher's predictive distribution, $h(x) \in \mathbb{R}^M$ 0, and the student's, $h(x) \in \mathbb{R}^M$ 1. In MoE settings, conventional KD only considers the sparse ensemble output per input, discarding any contribution from non-activated experts:

In Top- $h(x) \in \mathbb{R}^M$ 2 routing, only the $h(x) \in \mathbb{R}^M$ 3 selected experts' parameters participate in $h(x) \in \mathbb{R}^M$ 4. Non-activated ( $h(x) \in \mathbb{R}^M$ 5) experts have zero gradient and no distillation signal.
Empirical measurements in Llama-MoE models show that the sum of gate probabilities for $h(x) \in \mathbb{R}^M$ 6 activated experts often falls below 50%, implying that over half the "useful" expert mass is excluded from the targets available to the student (Kim et al., 18 Feb 2025).
This leads to the loss of complementary knowledge, lower robustness, and impaired student performance, especially when distilling from highly specialized, heterogeneous, or multi-domain MoE teachers.

Key challenges in MoE distillation include:

Aggregating and exposing all experts’ knowledge for transfer.
Retaining or emulating the diversity and specialization intrinsic to MoE.
Enabling computationally tractable and effective KD for both dense and sparse student architectures, across domains from language to vision and federated learning.

3. MoE-Specific Knowledge Distillation Methods

Recent studies introduce strategies that explicitly extract or aggregate knowledge from all experts—or ensembles of expert models—using MoE-specific distillation losses and routing mechanisms.

A. Knowledge Augmentation (KA) (Kim et al., 18 Feb 2025):

Augments teacher signals by sampling different expert subsets at each forward pass.
With probability $h(x) \in \mathbb{R}^M$ 7, sample $h(x) \in \mathbb{R}^M$ 8 experts using the gate distribution; with probability $h(x) \in \mathbb{R}^M$ 9, select top $M$ 0 experts.
Average softmax outputs across $M$ 1 such augmentations to form the student target.
The final loss interpolates between cross-entropy (CE) and KL to the augmented MoE output: $M$ 2.

B. Student-Aware Router (SAR) (Kim et al., 18 Feb 2025):

Updates the router network to align routing probabilities with the student's output distribution by optimizing $M$ 3 plus an auxiliary load-balance loss.
Student KD then proceeds using the router-adjusted, all-experts-active MoE output as the target.
The combined objective per training step is $M$ 4.

C. Ensemble and Aggregation Methods:

Some frameworks aggregate distinct domain experts via a transformer-based aggregator (Meta-DMoE) (Zhong et al., 2022), instance-level gating networks (MST-Distill) (Li et al., 9 Jul 2025), or class-based aggregation with meta-model integration (Mosaic) (Liu et al., 26 May 2025).
MoEKD (Awal et al., 13 Mar 2026) combines top- $M$ 5 expert selection from specialized teachers with router-learned aggregation for robustness and accuracy in code models.
AMoE (Chaybouti et al., 23 Dec 2025) applies multi-teacher vision distillation, aligning the geometric structure of each teacher space via relation-based losses and hierarchical data sampling.

4. Aggregation, Routing, and Mutual Distillation Mechanisms

MoE distillation literature systematically develops various routing and aggregation strategies to maximize the knowledge transfer:

Method	Aggregation/Selection Mechanism	Distillation Signal
Knowledge Augmentation	Repeated sampling/masking to cover all experts	Averaged soft targets over M $M$ 6 rounds
Student-Aware Router	Mutable router trained to match student needs	Router-aligned all-expert output
Meta-DMoE	Transformer-based fusion over per-domain experts	Fused feature-wise student targets
MST-Distill	Instance-level GateNet over cross-/multi-modal teachers	Weighted KL to top- $M$ 7 selected teachers
MoEKD	Softmax-router and top- $M$ 8 aggregation	Aggregated logits from selected experts
MoDE	Mutual pairwise (or to average) distillation among experts	Auxiliary MSE loss on expert outputs
Mosaic	Per-class aggregation and meta-model integration	KL and CE to synthetic MoE teacher logits
RbM (Graph KD)	Cosine-similarity gating, learned expert centers	CE, KL to graph teacher, with gating regularizers

These methods generalize beyond vanilla dense distillation—often yielding hybrid or "meta-distillation" objectives that blend soft targets constructed from ensemble, all-expert, or aggregated router-driven outputs, rather than static teacher predictions.

5. Applications, Empirical Gains, and Domain-Specific Findings

MoE distillation has demonstrated empirical benefit across natural language processing, computer vision, code intelligence, graph learning, and behavior modeling:

LLMs: MoE-distilled dense students match or even exceed vanilla dense-to-dense KD performance when KA or SAR is applied, securing ROUGE-L improvements of up to +0.84 over the best conventional baselines for Llama-MoE teachers (Kim et al., 18 Feb 2025). MoEKD yields up to 13% gain in vulnerability detection accuracy and up to 35.8% robustness gain over single-teacher KD (Awal et al., 13 Mar 2026). MoEBERT accelerates inference and increases accuracy on GLUE/SQuAD by distilling BERT-base into a MoE (Zuo et al., 2022).
Vision Models: AMoE leverages ARKD and multi-teacher distillation to surpass previous vision foundation models on image–text classification and retrieval, with strong performance at a compute fraction of previous baselines and improved k-NN clustering (Chaybouti et al., 23 Dec 2025).
Cross-Modal and Federated Learning: MST-Distill establishes instance-wise adaptive routing and plug-in masking, achieving superior transfer in challenging cross-modal distillation tasks (Li et al., 9 Jul 2025). Mosaic demonstrates strong performance in federated, heterogeneous settings, outperforming FL and data-free KD baselines by up to 15% (Liu et al., 26 May 2025).
Graph Learning: RbM (Routing-by-Memory) consistently outperforms dense MLP distillation in node classification by enforcing expert territory and explicit specialization (Rumiantsev et al., 2024).
Diffusion Policies: Variational Diffusion Distillation (VDD) leverages a variational inference upper bound to distill behaviorally diverse diffusion models into MoEs, achieving rapid, tractable inference and matching teacher task entropy across nine robot control domains (Zhou et al., 2024).

6. Theoretical Analyses, Ablations, and Best Practices

A salient theoretical finding is that non-activated experts in sparse MoE contain complementary information; when their knowledge is made available through expert-augmented losses or aggregation, specialized and robust students are achievable (Kim et al., 18 Feb 2025, Awal et al., 13 Mar 2026).

Moderate-strength mutual distillation among experts ("MoDE") improves each expert's domain-specific test performance and overall MoE generalization, but excessive strength collapses specialization (Xie et al., 2024).
Instance-level routing (e.g., GateNet, SAR) is consistently superior to uniform or static selection, yielding better transfer and dynamic adaptation to student needs (Li et al., 9 Jul 2025, Kim et al., 18 Feb 2025).
Data curation, as in hierarchical clustering for OpenLVD, enhances coverage and efficiency in large-scale MoE vision distillation (Chaybouti et al., 23 Dec 2025).
Ensemble MoE-based teachers in federated and data-free settings (Mosaic) outperform parameter averaging or single-teacher approaches in both predictive mean and robustness, with meta-models further boosting gain in highly heterogeneous regimes (Liu et al., 26 May 2025).

Hyperparameters matter: small augmentation ratio ( $M$ 9– $g_i(x) = \mathrm{softmax}(h(x))_i$ 0), $g_i(x) = \mathrm{softmax}(h(x))_i$ 1–4, moderate mutual distillation weight ( $g_i(x) = \mathrm{softmax}(h(x))_i$ 2) avoid collapse and maximize performance (Kim et al., 18 Feb 2025, Xie et al., 2024).

7. Limitations, Open Problems, and Directions

Mixture-of-Experts distillation remains an active area of research, with several open issues:

Task-level or static routing, while efficient, may under-exploit cross-task or fine-grained knowledge sharing (Kudugunta et al., 2021).
Routing fairness and expert load normalization—overspecialized or underused experts may degrade achievable performance in highly imbalanced regimes.
Scalability of mutual distillation to large, hierarchical, or dynamically growing expert sets bears further exploration (Xie et al., 2024).
There are trade-offs between inference speed, parameter efficiency, and knowledge preservation: some approaches retain full MoE model size but only reduce per-sample computation (e.g., hashing in MoEBERT) (Zuo et al., 2022).
Integrating richer intermediate targets (e.g., attention maps, layerwise features), augmenting with adversarial or entropy-based regularization (Awal et al., 13 Mar 2026), and unifying diffusion-based and classical MoE distillation are promising directions.

A plausible implication is that further advances in MoE-specific aggregation, dynamic routing, and mutual distillation will render MoE distillation a standard approach for scaling, compressing, and adapting multi-expert models in both research and deployment.