Auxiliary-Loss-Free Load Balancing (ALF-LB)

Updated 10 December 2025

ALF-LB is a strategy for routing tokens in sparse Mixture-of-Experts models that eliminates auxiliary loss terms by dynamically biasing expert selection.
The method updates per-expert bias based on measured token assignments, ensuring near-optimal load balance and maximizing resource utilization.
Empirical results show that ALF-LB outperforms traditional auxiliary-loss approaches with reduced perplexity and improved throughput in large-scale models.

Auxiliary-Loss-Free Load Balancing (ALF-LB) is a strategy for routing tokens in large-scale sparse Mixture-of-Experts (s-MoE) architectures without introducing auxiliary gradient noise or compromising model performance. By dynamically biasing expert selection prior to routing, ALF-LB achieves near-optimal load balance, eliminates conflicting training signals, and maximizes resource utilization in distributed environments. The method is theoretically grounded and empirically validated for both routing stability and computational efficiency, notably advancing MoE deployment at scale (Wang et al., 28 Aug 2024, Han et al., 3 Dec 2025).

1. Background: Load Balancing in Mixture-of-Experts Architectures

In MoE models, each input token $u_t\in\mathbb{R}^d$ is scored against $N$ expert centroids $e_i\in\mathbb{R}^d$ through a gating function $G$ , producing logits $s_{i,t}=G(u_t^\top e_i)$ . The standard top- $K$ routing mechanism selects the $K$ experts with the highest scores for each token:

$g_{i,t} = \begin{cases} s_{i,t}, & s_{i,t}\in \mathrm{TopK}(\{s_{j,t}\}_{j=1}^N,K),\ 0, & \text{otherwise.} \end{cases}$

The MoE output for each token is:

$h_t = u_t + \sum_{i=1}^N g_{i,t} \, \mathrm{FFN}_i(u_t).$

Imbalanced expert loads can result in routing collapse, wasted computation (idle experts), and inefficient hardware utilization. Traditional approaches mitigate this imbalance by introducing an auxiliary load-balance loss, typically of the form

$\mathcal{L}_{\text{bal}} = \alpha \sum_{i=1}^N f_i P_i,$

where $f_i$ measures assignment frequency and $P_i$ aggregates gating statistics. However, this $\alpha$ -weighted term injects interference gradients into the training process and requires hyperparameter tuning that trades off balance versus final perplexity (Wang et al., 28 Aug 2024).

2. ALF-LB Methodology: Expert-Wise Biasing and Feedback Control

ALF-LB obviates the need for auxiliary losses by introducing a lightweight, expert-wise bias $b_i$ to each expert’s routing score before the top- $K$ selection. The modified score is

$\tilde s_{i,t} = s_{i,t} + b_i,$

and routing proceeds by selecting the top- $K$ experts according to $\{\tilde s_{j,t}\}$ . Crucially, $b_i$ does not affect the output weight and resides outside the autograd computational graph, prohibiting any gradient leakage into the bias update.

The bias is updated every batch via measured assignment counts $c_i=\sum_{t=1}^T \mathbf{1}\{g_{i,t}>0\}$ , where $T$ is batch size. With ideal per-expert load $\bar c = (K T)/N$ and violation $e_i = \bar c - c_i$ , the bias follows an additive update:

$b_i \leftarrow b_i + u \, \mathrm{sign}(e_i),$

where $u>0$ is a small constant (e.g., $u=10^{-3}$ ). A proportional variant $b_i\leftarrow b_i+u e_i$ was tested but found less effective.

Key algorithm steps are:

Step	Description	Implementation Scope
Forward routing	Score each expert + apply bias $b_i$	Per token in each batch
Top- $K$ selection	Choose $K$ experts per token with highest $\tilde s_{i,t}$	Per token
Main loss update	Backprop with respect to $\theta$ , not $b_i$	Model parameters only
Bias update	Measure load, update $b_i$ per expert via additive rule	Outside autograd (no gradient leakage)

Pseudocode for the batch loop, using the additive update rule, is provided verbatim in (Wang et al., 28 Aug 2024).

3. Theoretical Analysis and Primal-Dual Interpretation

The ALF-LB procedure admits rigorous analysis as a primal-dual method for an integer assignment problem: route $T$ tokens to $E$ experts, each token to $K$ experts, with perfectly balanced load $L=KT/E$ (Han et al., 3 Dec 2025). Relaxing indicator variables $x_{ik}\in\{0,1\}$ , the Lagrangian formulation is:

$L(x,p)=\sum_{i,k}(\gamma_{ik}+p_k)x_{ik}-L\sum_k p_k$

where $p_k$ serves as an expert “shift” or bias and $\gamma_{ik}$ is a token–expert affinity. The update cycle comprises:

Dual (shift) update: $p_k^{n+1}\leftarrow p_k^n+\varepsilon_k^n(L-A_k^n)$
Primal (assignment) update: Route token $i$ to top- $K$ experts in $\gamma_{i\cdot}+p_{\cdot}^{n+1}$

Key structural properties:

Monotonic Lagrangian improvement: The Lagrangian objective decreases over each iteration as long as squared imbalance dominates aggregate switching benefit.
Preference rule: Tokens always move from overloaded to underloaded experts when switches occur, and the benefit per switch is bounded.
Approximate balancing guarantee: Loads $A_k$ converge within $L \pm (E-1)$ ; for large batch sizes, the imbalance is negligible.
Strong convexity and online learning: In stochastic environments, the dual objective is strongly convex and admits a logarithmic regret bound with diminishing step sizes.

4. Empirical Validation and Performance Metrics

ALF-LB has been empirically evaluated on DeepSeekMoE models with up to $3$B parameters trained on multilingual corpora. Notable experimental details (Wang et al., 28 Aug 2024, Han et al., 3 Dec 2025):

Datasets: WikiText-103, multilingual web data, code, literature; BPE vocab $32$K.
MoE sizes: $1$B–$3$B parameters, $E=64$ experts, $K=6$ selection.
Batch sizes: $1,152$ (1B), $1,728$ (3B).
Comparison: Auxiliary-loss ( $\alpha$ -balance), ALF-LB ( $u=10^{-3}$ , $u/|L-A_k|$ , $u/n$ ), and no balancing.

Load balance is quantified by maximal load violation ( $\mathrm{MaxVio}$ ) and idle-expert rate. Main findings:

Model Size	Method	Perplexity	MaxVio_global	Idle Rate	GPU Utilization	Throughput
1B	Aux-Loss ( $\alpha$ )	9.56	0.72	3.1%	96.9%	1.07×10⁶ tokens/s
1B	ALF-LB	9.50	0.04	1.1%–1.4%	98.6–98.9%	1.09–1.10×10⁶/s
3B	Aux-Loss	7.97	0.52	—	—	—
3B	ALF-LB	7.92	0.04	—	—	—

ALF-LB reduces idle-expert rate from $\sim$ 10% to 1–1.5%, improving utilization by 8–9 percentage points and throughput by 8–10% over no balancing. Perplexity and load balance show marked enhancement over both baseline and auxiliary-loss methods.

5. Algorithmic Ablations and Practical Recommendations

Experimental ablations indicate:

Bias update rate ( $u$ ): $u=10^{-3}$ yields optimal convergence; smaller rates ( $10^{-4}$ ) slow initial balancing, while larger ( $10^{-2}$ ) induce late-stage oscillations.
Update rule: Additive ( $b_i+u\,\mathrm{sign}(e_i)$ ) consistently outperforms proportional.
Bias form: Additive ( $s_i+b_i$ ) surpasses multiplicative ( $s_i \cdot b_i$ ) in perplexity, with equivalent balance.
Gating variants: Both sigmoid and softmax gates benefit from ALF-LB.

Implementation guidelines (Han et al., 3 Dec 2025):

Maintain per-expert bias vector in GPU memory; update via a scalar per expert per batch.
Step size: constant adaptive ( $u/|L-A_k|$ ) for rapid response; diminishing ( $u/n$ ) for strong theoretical guarantees.
Communication cost is minimal (one E-vector broadcast per layer per batch).
Step size $u$ should be tuned via grid search ( $10^{-4}$ to $10^{-1}$ ); $u/n$ is robust for dynamic workloads.
For multi-node sharding, bias synchronization may be required. Sensitivity increases if batch size $T$ approaches expert count $E$ .

6. Context, Limitations, and Theoretical Implications

The theoretical framework interprets ALF-LB as a projected online gradient update on the dual bias vector. The method provably moves tokens away from overloaded experts, monotonically reduces imbalance, and achieves an $O(\log N)$ regret rate with diminishing step sizes. The O( $E$ ) bound on per-expert imbalance is tight in deterministic settings. ALF-LB is simple to integrate into MoE architectures and imposes negligible overhead.

Limitations noted:

Synchronization overhead may arise with many expert nodes.
Sensitive tuning may be required when the number of tokens per batch becomes comparable to the number of experts.

This suggests ALF-LB is theoretically sound, practically robust, and highly scalable, making it a preferred load-balancing methodology for large-scale s-MoE systems when avoiding auxiliary gradient noise is paramount (Wang et al., 28 Aug 2024, Han et al., 3 Dec 2025).