Papers
Topics
Authors
Recent
Search
2000 character limit reached

EMA Routing for Mixture-of-Experts

Updated 22 April 2026
  • EMA Routing is a dynamic, sparse routing paradigm in MoE models that uses EMA to update per-expert thresholds for load balancing.
  • It enables causal token routing by comparing router logits with EMA-derived thresholds without needing auxiliary balancing losses.
  • Empirical results show ET routing reduces cross-entropy loss and training tokens compared to traditional TC-MoE methods.

Exponential Moving Average (EMA) Routing, specifically formulated as Expert Threshold (ET) routing, is a population-level sparse routing paradigm for mixture-of-experts (MoE) models. It is designed to enable dynamic computation allocation and expert load balancing without requiring explicit auxiliary losses. ET routing maintains per-expert thresholds using an exponential moving average over batch-statistics derived from router logits, facilitating a fully causal, batch-independent, and dynamically adaptive routing mechanism. This approach is targeted at scalable autoregressive language modeling and addresses limitations present in established routing methods such as Token-Choice MoE (TC-MoE) and Expert-Choice MoE (Sun et al., 12 Mar 2026).

1. EMA Threshold Computation and Maintenance

For each expert ee, a threshold Te(t)T_e^{(t)} is maintained and updated at every training step tt using an exponential moving average (EMA). Given a batch of NN tokens and EE experts, the target per-expert token allocation is k=N/Ek=N/E. For expert ee, the kk-th largest router logit among all tokens (denoted fe(t)f_e^{(t)}) is used as the statistic for threshold adjustment: Te(t)=βTe(t1)+(1β)fe(t)T_e^{(t)} = \beta\,T_e^{(t-1)} + (1-\beta)\,f_e^{(t)} where Te(t)T_e^{(t)}0 denotes the EMA decay factor, typically set to Te(t)T_e^{(t)}1. Here, Te(t)T_e^{(t)}2 is the Te(t)T_e^{(t)}3th-order statistic of Te(t)T_e^{(t)}4, with Te(t)T_e^{(t)}5 as the router logit assigned by expert Te(t)T_e^{(t)}6 to token Te(t)T_e^{(t)}7 at step Te(t)T_e^{(t)}8.

The initial threshold Te(t)T_e^{(t)}9 can be set to zero or to the appropriate order statistic from early batches; in practice, a short warm-up phase using alternative routing (e.g., Expert-Choice) is used to stabilize these threshold values before standard ET routing commences.

2. Token Routing Decision Rule

Each token tt0 with hidden state tt1 is processed by a router network producing logits tt2. The routing assignment is determined by comparing each tt3 to threshold tt4: tt5 Hence, a token may be routed to any subset of experts, including none, one, or all tt6, enabling natural dynamic compute scaling per token. At inference, tt7 is no longer updated; routing is fully causal and depends only on each token’s representation and the frozen, EMA-learned thresholds.

3. Dynamic Computation Allocation and Load Balancing

The ET routing paradigm eliminates the need for explicit auxiliary loss terms for expert balancing, contrasting with the traditional TC-MoE approach which fixes the number of experts per token (tt8) and applies auxiliary objectives to maintain load balance. In ET routing, each expert tt9 raises or lowers its threshold NN0 to target a fixed expected token share, enforcing

NN1

in expectation over tokens. This self-regulating property results in emergent, population-level balance. Empirically, the per-batch load for each expert fluctuates around NN2; a capacity factor NN3 (e.g., NN4) clips per-expert loads to NN5 during training to mitigate out-of-memory risk, but this is rarely invoked post-warm-up.

4. Causality, Batch Independence, and Inference Consistency

By updating thresholds via EMA and using independent token-to-expert comparisons, ET routing achieves a fully causal workflow: each token’s routing is computed independently, and at inference the procedure requires no batch aggregation or per-step recalibration. There is no discrepancy between training and inference routing behavior, as the same per-expert thresholds are applied throughout.

Compared to batch-dependent methods like Expert-Choice (batch top-NN6), where threshold variance scales as NN7, ET routing fixes thresholds and tolerates NN8 fluctuations in per-batch expert load, trading hardware resource predictability for consistent, batch-size-independent deployment and complete train-inference alignment.

5. Algorithmic Implementation

The core ET routing procedure consists of the following key steps:

  • For each token NN9 and each expert EE0, compute EE1 if EE2, else EE3.
  • At training, for each expert EE4:
    • Calculate EE5 as the EE6-th largest of the current batch's EE7.
    • Update EE8.
  • At inference, EE9 remains unchanged; routing decisions use the fixed EMA-learned threshold.
  • A capacity factor k=N/Ek=N/E0 optionally clips per-batch expert assignment counts to a safe interval during training.

This algorithm operates on the assignment matrix k=N/Ek=N/E1 and maintains updated per-expert thresholds throughout training.

6. Empirical Evaluation and Comparative Performance

In large-scale autoregressive pretraining with 2.4B parameter models on FineWeb-Edu, ET routing achieves superior perplexity relative to both dense and sparse TC-MoE baselines. Specifically, with models trained for 11.2B tokens:

  • Dense baseline: cross-entropy loss 2.751
  • TC-MoE (with auxiliary loss): 2.687
  • ET routing: 2.620

The reduction of 0.067 nats in cross-entropy for ET versus TC-MoE translates to reaching the same loss with approximately k=N/Ek=N/E2 fewer training tokens. After warm-up, expert saturation or starvation events—cases where capacity limits are exceeded or too few tokens are assigned—occur in well under 1% of training steps, indicating robust load-balancing behavior without auxiliary loss terms (Sun et al., 12 Mar 2026).

7. Design Choices and Practical Considerations

Key implementation features of ET routing include:

  • EMA decay k=N/Ek=N/E3, yielding an effective window of approximately 1000 steps.
  • Thresholds initialized to zero or with early-batch statistics; a 4k-step Expert-Choice warm-up enables threshold stabilization.
  • At inference, capacity-based clipping is disabled.
  • No auxiliary gating or balancing losses are incorporated; all token-expert assignment balance emerges from the threshold EMA mechanism.

In sum, ET routing replaces the hard “top-k=N/Ek=N/E4” per-token constraint of Token-Choice MoE and the batch-wide “top-k=N/Ek=N/E5” per-expert constraint of Expert-Choice MoE with a lightweight, EMA-modulated per-expert threshold. This yields dynamic compute per token, self-balanced expert loads in expectation, and fully causal inference, providing empirically validated gains in training efficiency and modeling quality compared to prior sparse routing techniques in large-scale language modeling (Sun et al., 12 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EMA (Exponential Moving Average) Routing.