BMoE: Equitable Load Balancing in MoE

Updated 7 January 2026

BMoE is a Mixture-of-Experts framework that balances expert activation to prevent routing collapse and under-training.
It employs gradient-free biasing and loss-free load balancing to optimize perplexity while maintaining statistical and hardware efficiency.
Advanced BMoE methods include grouped expert allocation and assignment-constrained strategies, ensuring robust performance in federated and multi-task settings.

A Balanced Mixture-of-Experts (BMoE) is a Mixture-of-Experts (MoE) architecture or training method that ensures equitable distribution of computational and learning load across all experts. Balancing expert utilization is necessary for optimal statistical efficiency, hardware throughput, and to avoid pathological scenarios—such as routing collapse, in which a few experts monopolize the data, leaving others under-trained or idle. Recent research has formalized various BMoE strategies, including gradient-free load balancing via dynamic biasing of routing scores, combinatorial group-based expert allocation, gradient balancing in multi-task settings, optimal assignment strategies in federated learning environments, and unbiased stochastic balancing estimators.

1. Motivations for Balanced Mixture-of-Experts

MoE architectures scale parameter counts by introducing $N$ modular "experts" (usually feed-forward sub-networks), but only activating a small subset (Top- $K$ ) for each token or input. If the routing mechanism is not explicitly balanced, learned or heuristic gates often result in highly uneven expert usage: a small fraction of experts process most tokens, undermining both parameter efficiency and device-level system throughput (Wang et al., 2024, Tang et al., 27 May 2025). This 'routing collapse' has negative implications:

Underutilized experts receive few or no gradient updates, impeding learning.
Overloaded experts become bottlenecks in distributed inference and training.
Hardware parallelization suffers, as expert assignments across devices become skewed.

For multi-task or federated learning, imbalance is exacerbated: tasks or clients may be inherently non-i.i.d., leading to natural load skew without fine-grained routing and global utilization controls (Zhang et al., 28 Dec 2025, Huang et al., 2023). BMoE frameworks explicitly target these issues by guaranteeing or statistically encouraging balanced expert assignments per batch, per device, or per task/client.

2. Traditional Loss-Based Balancing Approaches and Limitations

Canonical approaches to BMoE, notably in GShard and Switch Transformer, employ an auxiliary loss to penalize expert load imbalance. The typical formulation is

$\mathcal{L}_{\rm Balance} = \alpha \sum_{i=1}^N f_i P_i \quad\text{where}\quad f_i = \frac{N}{K T}\sum_{t=1}^T \mathbbm{1}(\text{token }t\to i), \;\; P_i = \frac{1}{T}\sum_{t=1}^T s_{i,t}$

with $s_{i,t}$ the routing logits, $f_i$ the activation frequency, and $P_i$ the average gating score for expert $i$ (Wang et al., 2024).

However, this method introduces a non-negligible trade-off: the auxiliary loss imparts gradients that can interfere with the primary task loss ( $\nabla_{\theta} \mathcal{L}_{\rm LM}$ ). Empirically, a small balancing coefficient ( $\alpha$ ) leads to poor balance, and a large $\alpha$ degrades perplexity. This establishes a dilemma, as shown by the monotonic trade-off between expert load deviation (MaxVio) and held-out perplexity (Wang et al., 2024). Similar limitations are observed in differentiable routing schemes and in MoGE's within-group balancing: perfect balance requires auxiliary regularization, but proliferation of loss terms complicates tuning and optimization stability (Tang et al., 27 May 2025).

3. Gradient-Free Biasing: Auxiliary-Loss-Free Load Balancing

Loss-Free Balancing (LFB), introduced in "Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts" (Wang et al., 2024), achieves robust expert balance by dynamic, gradient-isolated adjustment of expert-wise biases:

For each token $t$ and expert $i$ , routing logits are biased as $\hat{s}_{i,t} = s_{i,t} + b_i$ .
Top- $K$ routing is performed over the biased scores $\hat{s}_{i,t}$ , but downstream computation and backpropagation depend only on the original $s_{i,t}$ . No gradients flow through $b_i$ .
After each batch, observed expert utilization $c_i$ is compared to the prescribed target $\bar c = \frac{K\,\text{batch size}}{N}$ . The bias is updated by $b_i^{(t+1)} = b_i^{(t)} + \eta\,\mathrm{sign}(\bar c - c_i)$ , using a proportional controller.

This procedure ensures that over time, underutilized experts become more likely to be selected (bias increases), while overloaded experts' routing probability is decreased. Since the bias update is gradient-isolated, the method avoids interference with language modeling loss. Empirical results (Wang et al., 2024) show that Loss-Free Balancing attains both significantly lower perplexity and 10–20× better global load balance (as measured by MaxVio) than auxiliary-loss methods, with consistent low per-batch imbalance.

Loss-Free Balancing Algorithm:

Initialize: for all experts i, b_i ← 0
For each training batch B:
  1. For each token t in B:
       Compute raw gating scores s_{i,t}
       Form biased scores Ŝ_{i,t} = s_{i,t} + b_i
       Select Top-K experts for t based on Ŝ_{i,t}
       Compute MoE output using selected experts and s_{i,t}
  2. Backpropagate LM loss; update θ
  3. Count assignments c_i per expert
  4. Update biases: b_i ← b_i + η · sign( (K|B|/N) - c_i )

4. Grouped and Assignment-Constrained BMoE

Mixture of Grouped Experts (MoGE):

For device-level balance in distributed MoE, MoGE (Tang et al., 27 May 2025) groups the $N$ experts into $M$ disjoint sets, each mapped to one device. For each input, exactly $K' = K/M$ experts are activated from each group. The gating proceeds by:

Computing a global softmax over all experts $S = \mathrm{Softmax}(W^T h)$ .
Partitioning $S$ into $M$ sub-vectors $S_1,\dots,S_M$ .
In each group $j$ , selecting the Top- $K'$ (zeroing the rest), guaranteeing per-device symmetry.

Balance is mathematically enforced, $\mathrm{IS} = 0$ , where IS is Imbalance Score. An auxiliary loss over within-group activation frequencies further ensures uniform usage within each group. MoGE supports high system throughput, e.g., Pangu Pro MoE achieves 1148–1528 tokens/s/card (with speculative decoding) and exhibits near-uniform expert usage, outperforming parameter-matched dense models on Ascend NPUs (Tang et al., 27 May 2025).

FLEX-MoE in federated learning (Zhang et al., 28 Dec 2025): Federated BMoE faces additional complexity—non-i.i.d. partitioning, resource constraints (edge devices may store only a subset of experts), and inherent client/task heterogeneity. FLEX-MoE establishes system-level balance via the following:

"Fitness scores" $Q_t(c, e)$ measure expert suitability for client $c$ at round $t$ , updated via EMA of per-round feedback (accuracy/loss).
Expert assignment is optimized (per round) by solving an integer linear program (ILP) that maximizes cumulative fitness $\sum_{c, e} Q_t(c, e) \cdot X_{c,e}$ , subject to hard per-expert lower/upper bounds $L_{new}(e) \leq \sum_c |D_c| X_{c,e} \leq \Gamma_{new}(e)$ .
Tight $\delta_{ratio}$ constraint ensures per-round expert load discrepancy is at most $±\delta_{ratio} \cdot \tau$ .
Comprehensive experiments demonstrate that FLEX-MoE reduces Coefficient of Variation from 0.2–0.3 (greedy assignment) to $\approx 0.003$ , outperforms random and greedy selection by $+2$ – $6\%$ accuracy, and is robust under non-i.i.d. data.

5. Unbiased and Stochastic Balancing Strategies

Stochastic routing with strict capacity constraints introduces gradient bias if tokens are naively assigned and excesses are simply dropped or forcibly re-routed. "Unbiased Gradient Estimation with Balanced Assignments for Mixtures of Experts" (Kool et al., 2021) introduces two unbiased estimators:

Skip Estimator: Each expert's over-capacity tokens are subsampled uniformly. Remaining tokens are importance-weighted to correct estimator bias. The estimator is

$\hat g_{\text{skip}} = \frac1n \sum_{i=1}^n \delta_i \frac{n_{Z_i}}{\min(n_{Z_i},c)} \nabla_{\theta} \log p_{\theta}(Z_i|x_i)\, (f(x_i, Z_i)-b)$

Gumbel-Matching Estimator: Assignments are sampled via a capacity-constrained Gumbel-Max procedure, yielding strictly balanced allocations. Marginals are intractable; importance weights are approximated using conditional distributions computed via shortest-path calculations.

Empirical results on a toy regression demonstrate that skip-based estimators closely match unconstrained (i.i.d.-routed) training, whereas naively biased or high-variance estimators fail to solve the task robustly. The skip estimator is computationally efficient ( $O(n)$ ), with minimal variance compared to the Gumbel-matching estimator.

6. Task Gradient Balancing in Multi-Task BMoE

In multi-task learning, negative transfer arises if one task dominates shared parameter updates or receives disproportionate model capacity. "Modeling Task Relationships in Multi-variate Soft Sensor with Balanced Mixture-of-Experts" (Huang et al., 2023) proposes BMoE comprising:

Multi-Gate Mixture-of-Experts (MMoE): Each task has an independent gating network selecting among $K$ experts, enabling flexible positive sharing.
Task Gradient Balancing (TGB): Dynamically reweights per-task losses using GradNorm, equalizing the $L_2$ norm of loss gradients (on shared parameters) across tasks. Target gradient norms are calibrated by the relative inverse training rate ( $r_i$ ), so under-trained tasks are upweighted, mitigating negative transfer.

Empirical validation on a sulfur recovery unit soft-sensor, involving two quality targets, reveals that BMoE (MMoE + TGB) achieves lower RMSE and higher $R^2$ than both standard single-task and vanilla multi-task MoE baselines, with learned loss weights robustly converging regardless of initialization.

7. Limitations, Extensions, and Practical Considerations

Loss-Free Balancing requires careful tuning of the bias update rate $\eta$ : insufficient $\eta$ leads to slow convergence, whereas excessive $\eta$ can induce oscillatory routing. The current proportional controller may be generalizable: adding integral or derivative terms could suppress steady-state error or accelerate adaptation to nonstationary loads (Wang et al., 2024). Grouped and assignment-constrained approaches, while mathematically appealing, require co-design of expert-to-device mapping and parallelism parameters at system scale (Tang et al., 27 May 2025). In federated and stochastic settings, balancing guarantees typically trade off solution optimality (due to assignment constraints) versus computational cost (e.g., solving ILPs or costly combinatorial relaxations in each round) (Zhang et al., 28 Dec 2025, Kool et al., 2021).

In summary, BMoE approaches—gradient-isolated dynamic biasing, architectural group constraints, assignment-constrained optimization, unbiased stochastic capacity matching, and gradient magnitude balancing—represent a diverse, fast-evolving set of techniques for globally balancing expert utilization in MoE systems. Each targets the pathological load-skew-induced pathologies endemic to large-scale MoE, and collectively they advance scalable, efficient, and robust sparse modeling (Wang et al., 2024, Zhang et al., 28 Dec 2025, Tang et al., 27 May 2025, Huang et al., 2023, Kool et al., 2021).