Three Phases of Expert Routing: How Load Balance Evolves During Mixture-of-Experts Training

Published 5 Apr 2026 in cs.LG, cs.AI, and cs.MA | (2604.04230v1)

Abstract: We model Mixture-of-Experts (MoE) token routing as a congestion game with a single effective parameter, the congestion coefficient gamma_eff, that quantifies the balance-quality tradeoff. Tracking gamma_eff across training checkpoints of two open-source MoE models, OLMoE-1B-7B (20 checkpoints, with dense sampling in the surge region) and OpenMoE-8B (6 checkpoints), reveals a three-phase trajectory: a surge phase where the router learns to balance load (gamma_eff: 14 to 36-39, peaking in the step 30K-40K region), a stabilization phase where experts specialize under steady balance (B_0: 2.4 to 2.3, steps 100K-400K), and a relaxation phase where the router trades balance for quality as experts differentiate (gamma_eff: 27 to 9, steps 400K-1.2M). This non-monotone trajectory, invisible to post-hoc analysis of converged models, reveals that early MoE training prioritizes balance while late training prioritizes quality. The theoretical framework is honest about its limits: the single-type equilibrium reduces to temperature-scaled softmax (held-out L1: MFG = 0.199 vs. softmax = 0.200). The game is not a better predictor; it reveals what the temperature means and, critically, how that temperature evolves. We complement the dynamics with an effective congestion decomposition, a multi-type extension that improves load prediction via token clustering on all 16 layers (mean: 30%), scope diagnostics (K/M, epsilon_l), and robustness verification across four independent quality estimators (r >= 0.89). All confidence intervals are from bootstrap resampling over 50 independent text batches.

Abstract PDF Upgrade to Chat

Authors (1)

Charafeddine Mouzouni

Summary

The paper introduces a congestion game framework for MoE routing that identifies surge, stabilization, and relaxation phases through effective congestion parameter tracking.
It employs a multi-type mean-field game extension to handle token heterogeneity, ensuring a unique equilibrium and a 30% improvement in L1 error for held-out predictions.
Empirical results demonstrate a non-monotonic evolution of γ_eff, highlighting a critical tradeoff between achieving load balance and promoting expert specialization.

Three-Phase Dynamics in Mixture-of-Experts Routing: A Congestion Game Perspective

Overview

This work presents a formal framework rooted in congestion and mean-field game (MFG) theory to analyze the evolution of load balancing during Mixture-of-Experts (MoE) model pretraining. By introducing and empirically tracking an effective congestion coefficient $\gamma_{\mathrm{eff}}$ throughout the training trajectory of two open-source models (OLMoE-1B-7B and OpenMoE-8B), the paper uncovers three distinct training phases—surge, stabilization, and relaxation—characterized by specific trends in the tradeoff between expert load balance and expert specialization. The framework is extended with multi-type MFGs to account for token heterogeneity, and the authors carefully articulate both the theoretical significance and the predictive limits of their congestion-based modeling.

Theoretical Model: MoE Routing as a Congestion Game

Mixture-of-Experts architectures route tokens to a subset of $M$ available experts. The routing process can be conceptualized as a finite-state congestion game where:

Players: Tokens.
Resources: Experts.
Payoff structure: Each token's payoff depends on expert quality and a congestion penalty (load imbalance).
Population regime: With a large number of tokens ( $N=2,048$ –$32,768$), mean-field game theory applies.

The per-expert cost function takes the form

$\ell(i, \mu) = -q_i + \gamma \mu_i,$

with $q_i$ representing expert quality and $\gamma$ the congestion coefficient. The resultant equilibrium solution is a temperature-scaled softmax over effective expert qualities, with the “temperature” directly tied to the congestion parameter. The model exhibits a strictly convex Rosenthal potential, guaranteeing uniqueness and interiority of the equilibrium.

Crucially, the static single-type equilibrium does not outperform temperature-scaled softmax in held-out load prediction. Its principal explanatory value lies in revealing the meaning and evolution of the temperature parameter during learning.

Training Dynamics: Discovery of Three Phases

Tracking the fitted $\gamma_{\mathrm{eff}}$ parameter across 20 training checkpoints for OLMoE-1B-7B and six for OpenMoE-8B, a highly non-monotonic, three-phase trajectory emerges. This is the central empirical result.

Figure 1: Effective congestion $\gamma_{\mathrm{eff}$ across OLMoE-1B-7B training reveals surge, stabilization, and relaxation phases; the inverted-U trajectory is undetectable via static, post-hoc analysis.

Surge Phase (Early, Steps 5K–50K): The router rapidly escalates enforcement of load balancing, as evidenced by $\gamma_{\mathrm{eff}}$ rising from $M$ 0 to a peak in the $M$ 1– $M$ 2 range. Routing entropy increases, while expert quality spread ( $M$ 3) sharply decreases due to initial convergence.
Stabilization Phase (Mid, Steps 100K–400K): $M$ 4 plateaus (24–28), indicating a steady-state load balance regime while individual experts continue to specialize internally (with $M$ 5 dropping further). The router maintains a high-entropy, near-uniform distribution.
Relaxation Phase (Late, Steps 500K–1.2M): The learned router increasingly prioritizes assigning tokens to differentiated experts over maintaining strict balance; $M$ 6 declines from $M$ 7 to $M$ 8, while $M$ 9 is largely flat.

This trajectory is invariant to methods of expert quality estimation and is replicated across models with different $N=2,048$ 0, $N=2,048$ 1, and layer patterns, including OpenMoE-8B.

Figure 2: The observed three-phase behavior of $N=2,048$ 2 is robust across different quality estimation methods ( $N=2,048$ 3), excluding proxy artifacts as an explanation.

Analytical Extensions and Empirical Validation

Decomposition of Effective Congestion

$N=2,048$ 4 consists of an explicit component $N=2,048$ 5 from the routing auxiliary loss, and an implicit component $N=2,048$ 6 absorbed by the optimizer during training:

$N=2,048$ 7

Empirically, $N=2,048$ 8, with convergence ratios frequently $N=2,048$ 9–$32,768$0.

Multi-Type MFG for Token Heterogeneity

A $32,768$1-type extension introduces token heterogeneity, where each token type has its own expert quality vector and population share. The coupled equilibrium, strictly convex in its multi-type Rosenthal potential, yields

Uniqueness and interiority guarantees,
Improved held-out layerwise predictions (avg. $32,768$2 decrease in $32,768$3 error).

However, independent per-cluster softmax performs even better on well-balanced layers, underlining that the multi-type MFG’s strength is structural (guaranteed uniqueness and coupling), not necessarily predictive for already uniform distributions.

Scope Diagnostics and Limiting Factors

Anti-concentration threshold $32,768$4 is defined analytically, establishing a “safety margin” for avoiding expert collapse.
Top-$32,768$5 truncation bounds are derived; the static MFG model only holds significant predictive advantage when $32,768$6 is not too small.
Continuation spread ($32,768$7) quantifies the error introduced by per-layer myopic independence, correlating with observed fit degradation.

Empirical Results

The non-monotonicity of $32,768$8 (factor $32,768$9 between peak and final values) is significant and outside the range of sampling noise.
The explicit auxiliary loss is a small fraction of total balance pressure—the optimizer’s dynamics dominate.
The three-phase pattern is specific to pretraining; it is absent in later-stage annealing/fine-tuning checkpoints.
The static equivalence between MFG equilibrium and temperature-scaled softmax is confirmed numerically to within $\ell(i, \mu) = -q_i + \gamma \mu_i,$ 0 in $\ell(i, \mu) = -q_i + \gamma \mu_i,$ 1 error.
Across diverse architectures, the static MFG model is only useful for $\ell(i, \mu) = -q_i + \gamma \mu_i,$ 2 regimes.

Implications and Future Directions

The three-phase dynamics suggest that early-stage training should be attuned to fostering balance (as indicated by rapidly growing $\ell(i, \mu) = -q_i + \gamma \mu_i,$ 3), possibly with strong auxiliary loss, while late-stage training could relax balance constraints to allow for expert selectivity. The identification and monitoring of $\ell(i, \mu) = -q_i + \gamma \mu_i,$ 4 throughout pretraining could serve as a diagnostic for the health of routing dynamics and early warning of expert collapse (as $\ell(i, \mu) = -q_i + \gamma \mu_i,$ 5 approaches $\ell(i, \mu) = -q_i + \gamma \mu_i,$ 6).

Theory and empirical results imply MoE optimizers build much more internalized balance than explicit objectives would indicate. This raises open questions for architectural design and balance scheduling:

Can explicit control or adaptive scheduling of $\ell(i, \mu) = -q_i + \gamma \mu_i,$ 7 (and thus $\ell(i, \mu) = -q_i + \gamma \mu_i,$ 8) during training yield better utilization or improved specialization?
How does token population structure affect routing dynamics and specialization in large-scale, non-uniform data distributions?
Is the observed three-phase pattern universal across scales and architectures, including production-scale sparse MoE systems?

Conclusion

This work reframes MoE token routing as a mean-field congestion game, exposing the non-monotone, triphasic evolution of the balance-quality tradeoff during pretraining. While static equilibrium analysis reduces to familiar softmax form, tracking the effective congestion parameter reveals a tension in MoE optimization: balance is prioritized early, specialization and quality later. The methodology quantitatively characterizes the dynamics, provides both practical diagnostics and theoretical insights, and motivates new directions for load balancing and training strategy in scalable MoE architectures.