Papers
Topics
Authors
Recent
Search
2000 character limit reached

Default MoE: Efficient Sparse Routing

Updated 17 April 2026
  • Default MoE is a sparse mixture-of-experts approach that uses EMA-based default outputs to deliver dense gradient feedback during backpropagation.
  • It addresses challenges like sparse router gradients, load imbalance, and training instability by substituting missing expert activations with running averages.
  • The method enhances training stability and efficiency with minimal overhead, scaling effectively across various model sizes and expert configurations.

A Default MoE method is an approach to training sparse Mixture-of-Experts (MoE) models in which a feedforward router assigns each input token to a subset of experts, leveraging a sparsely-gated computation for improved efficiency. The term "Default MoE" has a specific technical meaning indicating a dense backpropagation regime where missing expert activations in the backward pass are substituted with running averages ("default outputs") of each expert’s past outputs. This strategy addresses key challenges in conventional Top-K MoE routing, notably sparse router gradients, load imbalance, and training instability, by supplying the router network with dense gradient feedback from all experts using approximated outputs, while maintaining sparse forward computation (Panda et al., 16 Apr 2025).

1. Default MoE Architecture and Standard Routing

In a typical sparse MoE layer, input tokens x∈Rdx \in \mathbb{R}^d are routed to KK out of NN experts by a router parameterized by matrix W∈RN×dW \in \mathbb{R}^{N \times d}:

  • The router produces logits Ï€=Softmax(Wx)∈RN\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N.
  • The KK experts with largest Ï€i\pi_i are selected (index set A=TopK(Ï€)A = \mathrm{TopK}(\pi)).
  • The MoE output is a weighted mixture of selected expert responses: y=∑i∈AÏ€iEi(x)y = \sum_{i \in A} \pi_i E_i(x).

Standard Top-K routing schemes activate only a few experts per token. Experts not selected for an input token do not participate in either the forward or backward pass for that token. In the backward pass, only the KK selected experts' outputs provide gradient signals to the router, resulting in a sparse gradient KK0 with zeros for all inactive experts. This sparse feedback causes slow or unstable router learning, load imbalance (where few experts may dominate), and can yield suboptimal convergence (Panda et al., 16 Apr 2025, Li et al., 2024, Guo et al., 28 May 2025).

2. Dense Backpropagation via Default Outputs

Default MoE approximates a dense gradient for router parameters without increasing forward pass cost. It maintains, for each expert KK1, a "default" output KK2, an exponential moving average (EMA) of observed outputs over training. For each training step KK3: KK4 where KK5 is the batch average over tokens routed to expert KK6, and KK7 is the EMA decay (e.g., KK8 for KK9).

The forward output per token is constructed as: NN0

For the backward pass, NN1 is given by: NN2 This supplies the router with dense gradient signals from all NN3 experts, approximating the full soft-gating gradient at minimal additional cost (Panda et al., 16 Apr 2025).

3. Algorithmic Implementation and Overhead

The Default MoE algorithm proceeds as follows:

Ï€i\pi_i7 Memory overhead is NN4 for EMAs, amounting to NN5 of model parameters for NN6, and runtime impact is negligible (NN7 overhead for 2B models, falling to NN8 for larger models) (Panda et al., 16 Apr 2025).

4. Key Hyperparameters and Ablations

Critical hyperparameters include:

  • NN9: Number of experts (e.g., W∈RN×dW \in \mathbb{R}^{N \times d}0 or W∈RN×dW \in \mathbb{R}^{N \times d}1).
  • W∈RN×dW \in \mathbb{R}^{N \times d}2: Experts activated per token (W∈RN×dW \in \mathbb{R}^{N \times d}3–W∈RN×dW \in \mathbb{R}^{N \times d}4).
  • EMA decay W∈RN×dW \in \mathbb{R}^{N \times d}5: For W∈RN×dW \in \mathbb{R}^{N \times d}6, W∈RN×dW \in \mathbb{R}^{N \times d}7 is typical. For W∈RN×dW \in \mathbb{R}^{N \times d}8, optimal W∈RN×dW \in \mathbb{R}^{N \times d}9 depends on Ï€=Softmax(Wx)∈RN\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N0 (Ï€=Softmax(Wx)∈RN\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N1–π=Softmax(Wx)∈RN\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N2).
  • Learning rate: Default MoE is stable at Ï€=Softmax(Wx)∈RN\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N3; TopK requires smaller values.
  • Auxiliary load balancing loss weight: Ï€=Softmax(Wx)∈RN\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N4.

Ablations indicate Default MoE outperforms TopK routing across all tested model sizes (hidden dims π=Softmax(Wx)∈RN\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N5–π=Softmax(Wx)∈RN\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N6, parameters π=Softmax(Wx)∈RN\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N7B–π=Softmax(Wx)∈RN\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N8B) and expert configurations. Benefits are greatest at lower sparsity. EMA initialization at zero outperforms random. Using default outputs π=Softmax(Wx)∈RN\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N9 in both forward and backward passes yields lower perplexity than restricting defaults to gradients. The approach scales well, with practical memory and computational costs (Panda et al., 16 Apr 2025).

5. Comparative Results and Specialization

Empirically, Default MoE achieves improvements in training stability and model quality. On a KK0B-parameter MoE (KK1, KK2, KK3, KK4B tokens trained), Default MoE increases final benchmark scores from KK5 (TopK) to KK6 (+2.8%), reaches target perplexity (KK7) with KK8 fewer tokens, enables stable training with higher learning rates, and outperforms architectures such as SparseMixer in early and mid training (Panda et al., 16 Apr 2025).

The introduction of dense, approximated router gradients in Default MoE leads to faster convergence, improved load balance, and avoidance of expert collapse. Default MoE retains the hardware and computational efficiency of sparse MoE, while addressing deficiencies of pure Top-K gating (Panda et al., 16 Apr 2025).

6. Relation to Auxiliary Losses and Recent Developments

The default MoE method is orthogonal to loss-level strategies addressing expert collapse and specialization:

  • Standard MoE methods employ an auxiliary load-balancing loss, KK9, that penalizes imbalance in expert assignment rates Ï€i\pi_i0 and routing weights Ï€i\pi_i1, enforcing Ï€i\pi_i2, Ï€i\pi_i3.
  • Recent work augments the default method with orthogonality loss Ï€i\pi_i4 (to diversify expert outputs) and variance loss Ï€i\pi_i5 (to increase per-expert routing-score variance), further improving specialization and downstream performance by up to 23.79% relative to classic baselines (Guo et al., 28 May 2025).

The Default MoE mechanism relates primarily to routing and gradient flow, while loss-level approaches manipulate training objectives directly. Both lines of development target improved expert utilization and adaptive, discriminative routing, but via distinct mechanisms—dense proxy gradients vs. direct regularization (Panda et al., 16 Apr 2025, Guo et al., 28 May 2025).

7. Limitations and Future Directions

Default MoE's approximation depends on the quality of EMA default outputs. Poorly tuned πi\pi_i6 or pathological routing dynamics may impair the informativeness of default vectors, suggesting further research on adaptive EMA strategies. The approach does not directly address distributed communication bottlenecks or expert locality, which remain active areas of optimization (e.g., as in LocMoE (Li et al., 2024)). Nevertheless, by supplying dense router gradients while preserving sparse computation, Default MoE constitutes a robust enhancement to standard Top-K MoE methodology, with demonstrated benefits for training large-scale sparse models (Panda et al., 16 Apr 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Default MoE Method.