EMA Routing for Mixture-of-Experts

Updated 22 April 2026

EMA Routing is a dynamic, sparse routing paradigm in MoE models that uses EMA to update per-expert thresholds for load balancing.
It enables causal token routing by comparing router logits with EMA-derived thresholds without needing auxiliary balancing losses.
Empirical results show ET routing reduces cross-entropy loss and training tokens compared to traditional TC-MoE methods.

Exponential Moving Average (EMA) Routing, specifically formulated as Expert Threshold (ET) routing, is a population-level sparse routing paradigm for mixture-of-experts (MoE) models. It is designed to enable dynamic computation allocation and expert load balancing without requiring explicit auxiliary losses. ET routing maintains per-expert thresholds using an exponential moving average over batch-statistics derived from router logits, facilitating a fully causal, batch-independent, and dynamically adaptive routing mechanism. This approach is targeted at scalable autoregressive language modeling and addresses limitations present in established routing methods such as Token-Choice MoE (TC-MoE) and Expert-Choice MoE (Sun et al., 12 Mar 2026).

1. EMA Threshold Computation and Maintenance

For each expert $e$ , a threshold $T_e^{(t)}$ is maintained and updated at every training step $t$ using an exponential moving average (EMA). Given a batch of $N$ tokens and $E$ experts, the target per-expert token allocation is $k=N/E$ . For expert $e$ , the $k$ -th largest router logit among all tokens (denoted $f_e^{(t)}$ ) is used as the statistic for threshold adjustment: $T_e^{(t)} = \beta\,T_e^{(t-1)} + (1-\beta)\,f_e^{(t)}$ where $T_e^{(t)}$ 0 denotes the EMA decay factor, typically set to $T_e^{(t)}$ 1. Here, $T_e^{(t)}$ 2 is the $T_e^{(t)}$ 3th-order statistic of $T_e^{(t)}$ 4, with $T_e^{(t)}$ 5 as the router logit assigned by expert $T_e^{(t)}$ 6 to token $T_e^{(t)}$ 7 at step $T_e^{(t)}$ 8.

The initial threshold $T_e^{(t)}$ 9 can be set to zero or to the appropriate order statistic from early batches; in practice, a short warm-up phase using alternative routing (e.g., Expert-Choice) is used to stabilize these threshold values before standard ET routing commences.

2. Token Routing Decision Rule

Each token $t$ 0 with hidden state $t$ 1 is processed by a router network producing logits $t$ 2. The routing assignment is determined by comparing each $t$ 3 to threshold $t$ 4: $t$ 5 Hence, a token may be routed to any subset of experts, including none, one, or all $t$ 6, enabling natural dynamic compute scaling per token. At inference, $t$ 7 is no longer updated; routing is fully causal and depends only on each token’s representation and the frozen, EMA-learned thresholds.

3. Dynamic Computation Allocation and Load Balancing

The ET routing paradigm eliminates the need for explicit auxiliary loss terms for expert balancing, contrasting with the traditional TC-MoE approach which fixes the number of experts per token ( $t$ 8) and applies auxiliary objectives to maintain load balance. In ET routing, each expert $t$ 9 raises or lowers its threshold $N$ 0 to target a fixed expected token share, enforcing

$N$ 1

in expectation over tokens. This self-regulating property results in emergent, population-level balance. Empirically, the per-batch load for each expert fluctuates around $N$ 2; a capacity factor $N$ 3 (e.g., $N$ 4) clips per-expert loads to $N$ 5 during training to mitigate out-of-memory risk, but this is rarely invoked post-warm-up.

4. Causality, Batch Independence, and Inference Consistency

By updating thresholds via EMA and using independent token-to-expert comparisons, ET routing achieves a fully causal workflow: each token’s routing is computed independently, and at inference the procedure requires no batch aggregation or per-step recalibration. There is no discrepancy between training and inference routing behavior, as the same per-expert thresholds are applied throughout.

Compared to batch-dependent methods like Expert-Choice (batch top- $N$ 6), where threshold variance scales as $N$ 7, ET routing fixes thresholds and tolerates $N$ 8 fluctuations in per-batch expert load, trading hardware resource predictability for consistent, batch-size-independent deployment and complete train-inference alignment.

5. Algorithmic Implementation

The core ET routing procedure consists of the following key steps:

For each token $N$ 9 and each expert $E$ 0, compute $E$ 1 if $E$ 2, else $E$ 3.
At training, for each expert $E$ $E$ 4:
- Calculate $E$ 5 as the $E$ 6-th largest of the current batch's $E$ 7.
- Update $E$ 8.
At inference, $E$ 9 remains unchanged; routing decisions use the fixed EMA-learned threshold.
A capacity factor $k=N/E$ 0 optionally clips per-batch expert assignment counts to a safe interval during training.

This algorithm operates on the assignment matrix $k=N/E$ 1 and maintains updated per-expert thresholds throughout training.

6. Empirical Evaluation and Comparative Performance

In large-scale autoregressive pretraining with 2.4B parameter models on FineWeb-Edu, ET routing achieves superior perplexity relative to both dense and sparse TC-MoE baselines. Specifically, with models trained for 11.2B tokens:

Dense baseline: cross-entropy loss 2.751
TC-MoE (with auxiliary loss): 2.687
ET routing: 2.620

The reduction of 0.067 nats in cross-entropy for ET versus TC-MoE translates to reaching the same loss with approximately $k=N/E$ 2 fewer training tokens. After warm-up, expert saturation or starvation events—cases where capacity limits are exceeded or too few tokens are assigned—occur in well under 1% of training steps, indicating robust load-balancing behavior without auxiliary loss terms (Sun et al., 12 Mar 2026).

7. Design Choices and Practical Considerations

Key implementation features of ET routing include:

EMA decay $k=N/E$ 3, yielding an effective window of approximately 1000 steps.
Thresholds initialized to zero or with early-batch statistics; a 4k-step Expert-Choice warm-up enables threshold stabilization.
At inference, capacity-based clipping is disabled.
No auxiliary gating or balancing losses are incorporated; all token-expert assignment balance emerges from the threshold EMA mechanism.

In sum, ET routing replaces the hard “top- $k=N/E$ 4” per-token constraint of Token-Choice MoE and the batch-wide “top- $k=N/E$ 5” per-expert constraint of Expert-Choice MoE with a lightweight, EMA-modulated per-expert threshold. This yields dynamic compute per token, self-balanced expert loads in expectation, and fully causal inference, providing empirically validated gains in training efficiency and modeling quality compared to prior sparse routing techniques in large-scale language modeling (Sun et al., 12 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Expert Threshold Routing for Autoregressive Language Modeling with Dynamic Computation Allocation and Load Balancing (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EMA (Exponential Moving Average) Routing.