Default MoE: Efficient Sparse Routing

Updated 17 April 2026

Default MoE is a sparse mixture-of-experts approach that uses EMA-based default outputs to deliver dense gradient feedback during backpropagation.
It addresses challenges like sparse router gradients, load imbalance, and training instability by substituting missing expert activations with running averages.
The method enhances training stability and efficiency with minimal overhead, scaling effectively across various model sizes and expert configurations.

A Default MoE method is an approach to training sparse Mixture-of-Experts (MoE) models in which a feedforward router assigns each input token to a subset of experts, leveraging a sparsely-gated computation for improved efficiency. The term "Default MoE" has a specific technical meaning indicating a dense backpropagation regime where missing expert activations in the backward pass are substituted with running averages ("default outputs") of each expert’s past outputs. This strategy addresses key challenges in conventional Top-K MoE routing, notably sparse router gradients, load imbalance, and training instability, by supplying the router network with dense gradient feedback from all experts using approximated outputs, while maintaining sparse forward computation (Panda et al., 16 Apr 2025).

1. Default MoE Architecture and Standard Routing

In a typical sparse MoE layer, input tokens $x \in \mathbb{R}^d$ are routed to $K$ out of $N$ experts by a router parameterized by matrix $W \in \mathbb{R}^{N \times d}$ :

The router produces logits $\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N$ .
The $K$ experts with largest $\pi_i$ are selected (index set $A = \mathrm{TopK}(\pi)$ ).
The MoE output is a weighted mixture of selected expert responses: $y = \sum_{i \in A} \pi_i E_i(x)$ .

Standard Top-K routing schemes activate only a few experts per token. Experts not selected for an input token do not participate in either the forward or backward pass for that token. In the backward pass, only the $K$ selected experts' outputs provide gradient signals to the router, resulting in a sparse gradient $K$ 0 with zeros for all inactive experts. This sparse feedback causes slow or unstable router learning, load imbalance (where few experts may dominate), and can yield suboptimal convergence (Panda et al., 16 Apr 2025, Li et al., 2024, Guo et al., 28 May 2025).

2. Dense Backpropagation via Default Outputs

Default MoE approximates a dense gradient for router parameters without increasing forward pass cost. It maintains, for each expert $K$ 1, a "default" output $K$ 2, an exponential moving average (EMA) of observed outputs over training. For each training step $K$ 3: $K$ 4 where $K$ 5 is the batch average over tokens routed to expert $K$ 6, and $K$ 7 is the EMA decay (e.g., $K$ 8 for $K$ 9).

The forward output per token is constructed as: $N$ 0

For the backward pass, $N$ 1 is given by: $N$ 2 This supplies the router with dense gradient signals from all $N$ 3 experts, approximating the full soft-gating gradient at minimal additional cost (Panda et al., 16 Apr 2025).

3. Algorithmic Implementation and Overhead

The Default MoE algorithm proceeds as follows:

$\pi_i$ 7 Memory overhead is $N$ 4 for EMAs, amounting to $N$ 5 of model parameters for $N$ 6, and runtime impact is negligible ( $N$ 7 overhead for 2B models, falling to $N$ 8 for larger models) (Panda et al., 16 Apr 2025).

4. Key Hyperparameters and Ablations

Critical hyperparameters include:

$N$ 9: Number of experts (e.g., $W \in \mathbb{R}^{N \times d}$ 0 or $W \in \mathbb{R}^{N \times d}$ 1).
$W \in \mathbb{R}^{N \times d}$ 2: Experts activated per token ( $W \in \mathbb{R}^{N \times d}$ 3– $W \in \mathbb{R}^{N \times d}$ 4).
EMA decay $W \in \mathbb{R}^{N \times d}$ 5: For $W \in \mathbb{R}^{N \times d}$ 6, $W \in \mathbb{R}^{N \times d}$ 7 is typical. For $W \in \mathbb{R}^{N \times d}$ 8, optimal $W \in \mathbb{R}^{N \times d}$ 9 depends on $\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N$ 0 ( $\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N$ 1– $\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N$ 2).
Learning rate: Default MoE is stable at $\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N$ 3; TopK requires smaller values.
Auxiliary load balancing loss weight: $\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N$ 4.

Ablations indicate Default MoE outperforms TopK routing across all tested model sizes (hidden dims $\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N$ 5– $\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N$ 6, parameters $\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N$ 7B– $\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N$ 8B) and expert configurations. Benefits are greatest at lower sparsity. EMA initialization at zero outperforms random. Using default outputs $\pi = \mathrm{Softmax}(Wx) \in \mathbb{R}^N$ 9 in both forward and backward passes yields lower perplexity than restricting defaults to gradients. The approach scales well, with practical memory and computational costs (Panda et al., 16 Apr 2025).

5. Comparative Results and Specialization

Empirically, Default MoE achieves improvements in training stability and model quality. On a $K$ 0B-parameter MoE ( $K$ 1, $K$ 2, $K$ 3, $K$ 4B tokens trained), Default MoE increases final benchmark scores from $K$ 5 (TopK) to $K$ 6 (+2.8%), reaches target perplexity ( $K$ 7) with $K$ 8 fewer tokens, enables stable training with higher learning rates, and outperforms architectures such as SparseMixer in early and mid training (Panda et al., 16 Apr 2025).

The introduction of dense, approximated router gradients in Default MoE leads to faster convergence, improved load balance, and avoidance of expert collapse. Default MoE retains the hardware and computational efficiency of sparse MoE, while addressing deficiencies of pure Top-K gating (Panda et al., 16 Apr 2025).

6. Relation to Auxiliary Losses and Recent Developments

The default MoE method is orthogonal to loss-level strategies addressing expert collapse and specialization:

Standard MoE methods employ an auxiliary load-balancing loss, $K$ 9, that penalizes imbalance in expert assignment rates $\pi_i$ 0 and routing weights $\pi_i$ 1, enforcing $\pi_i$ 2, $\pi_i$ 3.
Recent work augments the default method with orthogonality loss $\pi_i$ 4 (to diversify expert outputs) and variance loss $\pi_i$ 5 (to increase per-expert routing-score variance), further improving specialization and downstream performance by up to 23.79% relative to classic baselines (Guo et al., 28 May 2025).

The Default MoE mechanism relates primarily to routing and gradient flow, while loss-level approaches manipulate training objectives directly. Both lines of development target improved expert utilization and adaptive, discriminative routing, but via distinct mechanisms—dense proxy gradients vs. direct regularization (Panda et al., 16 Apr 2025, Guo et al., 28 May 2025).

7. Limitations and Future Directions

Default MoE's approximation depends on the quality of EMA default outputs. Poorly tuned $\pi_i$ 6 or pathological routing dynamics may impair the informativeness of default vectors, suggesting further research on adaptive EMA strategies. The approach does not directly address distributed communication bottlenecks or expert locality, which remain active areas of optimization (e.g., as in LocMoE (Li et al., 2024)). Nevertheless, by supplying dense router gradients while preserving sparse computation, Default MoE constitutes a robust enhancement to standard Top-K MoE methodology, with demonstrated benefits for training large-scale sparse models (Panda et al., 16 Apr 2025).