EMA Routing for Mixture-of-Experts
- EMA Routing is a dynamic, sparse routing paradigm in MoE models that uses EMA to update per-expert thresholds for load balancing.
- It enables causal token routing by comparing router logits with EMA-derived thresholds without needing auxiliary balancing losses.
- Empirical results show ET routing reduces cross-entropy loss and training tokens compared to traditional TC-MoE methods.
Exponential Moving Average (EMA) Routing, specifically formulated as Expert Threshold (ET) routing, is a population-level sparse routing paradigm for mixture-of-experts (MoE) models. It is designed to enable dynamic computation allocation and expert load balancing without requiring explicit auxiliary losses. ET routing maintains per-expert thresholds using an exponential moving average over batch-statistics derived from router logits, facilitating a fully causal, batch-independent, and dynamically adaptive routing mechanism. This approach is targeted at scalable autoregressive language modeling and addresses limitations present in established routing methods such as Token-Choice MoE (TC-MoE) and Expert-Choice MoE (Sun et al., 12 Mar 2026).
1. EMA Threshold Computation and Maintenance
For each expert , a threshold is maintained and updated at every training step using an exponential moving average (EMA). Given a batch of tokens and experts, the target per-expert token allocation is . For expert , the -th largest router logit among all tokens (denoted ) is used as the statistic for threshold adjustment: where 0 denotes the EMA decay factor, typically set to 1. Here, 2 is the 3th-order statistic of 4, with 5 as the router logit assigned by expert 6 to token 7 at step 8.
The initial threshold 9 can be set to zero or to the appropriate order statistic from early batches; in practice, a short warm-up phase using alternative routing (e.g., Expert-Choice) is used to stabilize these threshold values before standard ET routing commences.
2. Token Routing Decision Rule
Each token 0 with hidden state 1 is processed by a router network producing logits 2. The routing assignment is determined by comparing each 3 to threshold 4: 5 Hence, a token may be routed to any subset of experts, including none, one, or all 6, enabling natural dynamic compute scaling per token. At inference, 7 is no longer updated; routing is fully causal and depends only on each token’s representation and the frozen, EMA-learned thresholds.
3. Dynamic Computation Allocation and Load Balancing
The ET routing paradigm eliminates the need for explicit auxiliary loss terms for expert balancing, contrasting with the traditional TC-MoE approach which fixes the number of experts per token (8) and applies auxiliary objectives to maintain load balance. In ET routing, each expert 9 raises or lowers its threshold 0 to target a fixed expected token share, enforcing
1
in expectation over tokens. This self-regulating property results in emergent, population-level balance. Empirically, the per-batch load for each expert fluctuates around 2; a capacity factor 3 (e.g., 4) clips per-expert loads to 5 during training to mitigate out-of-memory risk, but this is rarely invoked post-warm-up.
4. Causality, Batch Independence, and Inference Consistency
By updating thresholds via EMA and using independent token-to-expert comparisons, ET routing achieves a fully causal workflow: each token’s routing is computed independently, and at inference the procedure requires no batch aggregation or per-step recalibration. There is no discrepancy between training and inference routing behavior, as the same per-expert thresholds are applied throughout.
Compared to batch-dependent methods like Expert-Choice (batch top-6), where threshold variance scales as 7, ET routing fixes thresholds and tolerates 8 fluctuations in per-batch expert load, trading hardware resource predictability for consistent, batch-size-independent deployment and complete train-inference alignment.
5. Algorithmic Implementation
The core ET routing procedure consists of the following key steps:
- For each token 9 and each expert 0, compute 1 if 2, else 3.
- At training, for each expert 4:
- Calculate 5 as the 6-th largest of the current batch's 7.
- Update 8.
- At inference, 9 remains unchanged; routing decisions use the fixed EMA-learned threshold.
- A capacity factor 0 optionally clips per-batch expert assignment counts to a safe interval during training.
This algorithm operates on the assignment matrix 1 and maintains updated per-expert thresholds throughout training.
6. Empirical Evaluation and Comparative Performance
In large-scale autoregressive pretraining with 2.4B parameter models on FineWeb-Edu, ET routing achieves superior perplexity relative to both dense and sparse TC-MoE baselines. Specifically, with models trained for 11.2B tokens:
- Dense baseline: cross-entropy loss 2.751
- TC-MoE (with auxiliary loss): 2.687
- ET routing: 2.620
The reduction of 0.067 nats in cross-entropy for ET versus TC-MoE translates to reaching the same loss with approximately 2 fewer training tokens. After warm-up, expert saturation or starvation events—cases where capacity limits are exceeded or too few tokens are assigned—occur in well under 1% of training steps, indicating robust load-balancing behavior without auxiliary loss terms (Sun et al., 12 Mar 2026).
7. Design Choices and Practical Considerations
Key implementation features of ET routing include:
- EMA decay 3, yielding an effective window of approximately 1000 steps.
- Thresholds initialized to zero or with early-batch statistics; a 4k-step Expert-Choice warm-up enables threshold stabilization.
- At inference, capacity-based clipping is disabled.
- No auxiliary gating or balancing losses are incorporated; all token-expert assignment balance emerges from the threshold EMA mechanism.
In sum, ET routing replaces the hard “top-4” per-token constraint of Token-Choice MoE and the batch-wide “top-5” per-expert constraint of Expert-Choice MoE with a lightweight, EMA-modulated per-expert threshold. This yields dynamic compute per token, self-balanced expert loads in expectation, and fully causal inference, providing empirically validated gains in training efficiency and modeling quality compared to prior sparse routing techniques in large-scale language modeling (Sun et al., 12 Mar 2026).