SMoE: Sparsely-Activated Mixture-of-Experts
- SMoE models are conditional neural networks that activate only a few experts per token, enabling scalable architectures with low per-token compute.
- Default MoE introduces an EMA-based default vector to densify router gradients, addressing training instability and expert load imbalance.
- Empirical results show improved convergence, stability, and performance on benchmarks like MMLU and LAMBADA with minimal computational overhead.
Sparsely-Activated Mixture-of-Experts (SMoE) models are neural network architectures exploiting conditional computation, where only a small, dynamically selected subset of network components (“experts”) is activated for each input or token. This approach facilitates scaling model capacity to hundreds of billions of parameters while maintaining constant or low per-token compute, enabling efficient deployment in large-scale language, vision, or multimodal systems.
1. Fundamentals and Architectural Principles
SMoE models interleave standard dense layers (e.g., attention, normalization) with Mixture-of-Experts (MoE) layers, in which a routing mechanism (router) selects experts from a pool of candidates for each input token. Each expert is typically a parameterized feedforward block (e.g., MLP, convolutional module, or attention head), with only the selected experts evaluated and used in computing the token’s representation. The router outputs selection and/or weighting probabilities per expert, and the layer output is a sparsely weighted sum of expert outputs.
Canonical SMoE layer operation:
where is the set of activated experts for input .
Sparse Backward Updates and Training Instability
A core challenge of SMoE training is that the sparse activation in the forward pass is mirrored by sparse gradient flow in the backward pass. Specifically, only the parameters of the selected experts—and their selection probabilities in the router—receive nonzero gradients for each token. This gradient sparsity results in:
- Slow and noisy router training,
- Poor expert specialization,
- Load imbalance (over- and under-utilized experts),
- Reduced training stability, especially at higher learning rates.
These behaviors have been formally and empirically substantiated by analyses of SMoE variants relying on TopK routing (Panda et al., 16 Apr 2025).
2. Solutions for Stable and Efficient Training
Default MoE: Dense Gradients for Routers
To address sparse backward updates, Default MoE (Panda et al., 16 Apr 2025) introduces a lightweight, gradient-densifying mechanism. The key concept is to provide a surrogate output—a “default vector”—for each expert not activated in the current forward pass, ensuring dense router gradient coverage in the backward pass without computing unnecessary expert outputs. For each expert , the default output is maintained as an exponential moving average (EMA) over that expert’s historical outputs:
During backpropagation, missing activations are filled with their corresponding , so that every expert provides a signal to the router for every token, while preserving the sparse forward computation:
This dense surrogate gradient mechanism allows the router to receive training signals from all experts at every token, overcoming the classic sparsity-induced learning bottleneck.
Algorithmic Implementation
Per batch, the operational workflow comprises:
- Sparse TopK forward pass, accumulate real expert outputs for active experts.
- EMA update of each expert’s default vector, using only the outputs of actually activated experts from the batch.
- During backward, substitute missing outputs with the EMA default vector, allowing backpropagation from all experts to the router.
- Parameter updates proceed as usual.
This approach incurs negligible memory overhead (storing vectors, the hidden/output dimension), and no extra forward computation, as only default vector tracking is added.
3. Router Gradient Formulation and Mathematical Details
Mathematically, for each output token and expert , the new gradient for the router parameters with respect to its output mixing weights takes the form:
In classic TopK routing, the term would be $0$ for , leading to sparse and potentially unrepresentative update signals.
The EMA parameter is a hyperparameter controlling the temporal smoothing of the default vectors; tuning is necessary for different levels of sparsity or across model widths.
4. Empirical Results and Performance Impact
Extensive benchmarks on LLM and general reasoning datasets (e.g., MMLU, LAMBADA, ARC, MathQA) with 2B parameter models and 8-32 experts (with , i.e., extreme sparsity) demonstrate that Default MoE outperforms classic TopK routing in every setting tested. For example, after 320B pretraining tokens, Default MoE delivered an average 2.8% higher accuracy across key evaluations.
Key metrics:
| Task | TopK MoE | Default MoE |
|---|---|---|
| MathQA | 25.8 | 25.9 |
| LogiQA | 26.1 | 27.2 |
| MMLU | 31.8 | 32.5 |
| LAMBADA | 38.8 | 41.3 |
| ARC | 45.9 | 47.8 |
| PIQA | 71.1 | 71.7 |
| Average | 42.2 | 43.4 |
Convergence to a fixed perplexity is 9% faster. The approach is robust to model size, remaining effective for models with more than 7B parameters, where the overhead for default vector storage or computation becomes negligible relative to total memory/computation. Across all model widths, EMA decay rates, sparsity levels, and architectures tested—including high-expert-count, highly sparse regimes—Default MoE outperformed both TopK and gradient approximation alternatives (e.g., SparseMixer).
5. Scaling, Stability, and Deployment Considerations
The Default MoE strategy is agnostic to model architecture and can be integrated into existing SMoE codebases with minimal modification, as the only required addition is the EMA tracking and substitution logic on the expert outputs during backpropagation. As model size increases, the additional per-expert memory overhead becomes amortized.
Key operational advantages include:
- Increased tolerance for higher learning rates (reducing divergence and load spikes),
- Enhanced expert load balancing, preventing expert underuse or collapse,
- Reduction in early training instability and improved convergence profiles.
Hyperparameter selection, such as EMA decay , can be tuned readily; the method is robust under reasonable ranges, particularly as sparseness increases.
6. Broader Implications and Future Directions
The introduction of Default MoE demonstrates that the principal obstacle for stable, high-quality training in sparsely-activated expert architectures is not the sparse forward pass but the sparse gradient signal received by the router. Lightweight surrogate gradient schemes, such as EMA-based default outputs, can enable near-dense learning dynamics without sacrificing the computational efficiency of sparse activation. This insight can be extended to other conditional computation designs or alternative MoE architectures.
For practitioners, integrating Default MoE can be expected to yield immediate gains in both pretraining and finetuning performance, faster convergence, and greater architectural robustness, particularly in extremely sparse or high-expert-count settings.
References Table
| Mechanism | Core Idea | Impact |
|---|---|---|
| TopK Routing | Sparse forward+backward; active | Inefficient signals, unstable, slow |
| Default MoE | EMA default vector for backprop | Dense router gradients, stable, accurate |
| Dense MoE | All experts always active | High compute/memory, always dense |
Default MoE overcomes the sparse gradient bottleneck in SMoEs by introducing an EMA-based dense gradient surrogate without extra forward compute, improving stability and training efficiency, and achieving consistently superior benchmark performance at negligible additional cost (Panda et al., 16 Apr 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free