Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models

Published 21 Jan 2025 in cs.LG and cs.CL | (2501.11873v2)

Abstract: This paper revisits the implementation of $\textbf{L}$oad-$\textbf{b}$alancing $\textbf{L}$oss (LBL) when training Mixture-of-Experts (MoEs) models. Specifically, LBL for MoEs is defined as $N_E \sum_{i=1}^{N_E} f_i p_i$, where $N_E$ is the total number of experts, $f_i$ represents the frequency of expert $i$ being selected, and $p_i$ denotes the average gating score of the expert $i$. Existing MoE training frameworks usually employ the parallel training strategy so that $f_i$ and the LBL are calculated within a $\textbf{micro-batch}$ and then averaged across parallel groups. In essence, a micro-batch for training billion-scale LLMs normally contains very few sequences. So, the micro-batch LBL is almost at the sequence level, and the router is pushed to distribute the token evenly within each sequence. Under this strict constraint, even tokens from a domain-specific sequence ($\textit{e.g.}$, code) are uniformly routed to all experts, thereby inhibiting expert specialization. In this work, we propose calculating LBL using a $\textbf{global-batch}$ to loose this constraint. Because a global-batch contains much more diverse sequences than a micro-batch, which will encourage load balance at the corpus level. Specifically, we introduce an extra communication step to synchronize $f_i$ across micro-batches and then use it to calculate the LBL. Through experiments on training MoEs-based LLMs (up to $\textbf{42.8B}$ total parameters and $\textbf{400B}$ tokens), we surprisingly find that the global-batch LBL strategy yields excellent performance gains in both pre-training perplexity and downstream tasks. Our analysis reveals that the global-batch LBL also greatly improves the domain specialization of MoE experts.

Abstract PDF Chat (Pro)

Summary

The paper redefines load-balancing loss by aggregating expert selection counts over global batches instead of micro-batches, preventing uniform token routing in domain-specific sequences.
It introduces global synchronization and buffering techniques that reduce pre-training perplexity and enhance downstream zero-shot performance across various MoE models.
The approach fosters clearer expert specialization and interpretable routing behaviors with minimal computational overhead, marking a significant improvement in MoE training.

The paper presents a technical investigation into the computation of the Load-balancing Loss (LBL) in Mixture-of-Experts (MoE) training, pinpointing a critical shortcoming in the conventional micro-batch implementation. The work rigorously argues that when LBL is computed at the micro-batch level—where each micro-batch comprises only a handful of sequences—the balancing constraint becomes overly strict. In such cases, the router is forced to distribute tokens uniformly even within domain-specific sequences, which in turn inhibits the natural specialization of experts.

The key contributions and findings can be summarized as follows:

Redefinition of LBL Computation:
- $N_E$ is the total number of experts,
- $f_i$ is the fraction of tokens routed to expert $i$ , and
- $p_i$ is the average gating score,
- is typically computed on a per micro-batch basis. The paper demonstrates that such micro-batch LBL enforces sequence-level uniformity in expert selection, thus impeding domain-level specialization. To overcome this, the authors propose synchronizing the expert selection frequency $f_i$ across all parallel groups to compute a global-batch LBL. In practice, this means aggregating the token counts from multiple micro-batches (or gradient accumulation steps) so that the LBL is calculated over a much larger and more heterogeneous token set.
Synchronization and Buffering Mechanisms:
- Global Synchronization: Expert selection frequencies are exchanged among parallel groups so that the LBL is computed with global statistics rather than isolated micro-batch statistics.
- Buffering for Limited Compute Resources: When the total number of tokens per training step is restricted by limited hardware (i.e., when the aggregated micro-batches do not reach the desired global batch size), a buffer is employed. This buffer accumulates the expert selection counts across gradient accumulation steps, approximating the effect of a larger global batch without incurring significant communication overhead (overall overhead is reported as less than 3%).
Experimental Results:
- Models with parameters scaled to 3.4B (with 0.6B active), 15B (with 2.54B active), and 43B (with 6.6B active).
- Experiments are conducted under different global batch size regimes (e.g., 512 versus 1024) and various Balance Batch Sizes (Balance BSZ). Increasing the Balance BSZ consistently results in:
- Improvement in pre-training perplexity (PPL decreases by approximately 0.1 units).
- Enhanced downstream zero-shot performance on benchmarks such as Hellaswag, MMLU, GSM8k, and C-eval, with some tasks showing an improvement of around 2 points.
- A detailed ablation study reveals that when tokens are aggregated via synchronization (or even a shuffled sampling that mimics global distribution), the performance is significantly better than when relying solely on micro-batch statistics. This confirms that domain diversification in LBL calculation is a critical driver of both performance enhancement and expert specialization.
- Furthermore, dynamic switching experiments—changing the Balance BSZ during training—underscore that an early, strict micro-batch balancing can irreversibly lock the router into suboptimal behavior, while a transition to global-batch LBL later in training yields limited recovery.
Expert Specialization Analysis:
- In the global-batch setup, high-frequency experts emerge for specific domains (with selection frequencies exceeding 0.2), whereas under micro-batch LBL, the distribution remains nearly uniform across experts.
- The topK sum of routing scores (an indicator of the gating network's confidence) is higher under the global-batch regime, suggesting a closer alignment between routing decisions and language modeling objectives.
Computational Efficiency:

Although global synchronization incurs a minor overhead (on the order of 5–6% slower per iteration in the largest model settings), this latency is largely mitigated by overlapping LBL computation with other network operations. Additionally, introducing a small micro-batch balancing term alongside the global computation helps reduce local imbalances with negligible performance loss.

Limitations and Future Directions:

The investigation primarily focuses on pre-training scenarios in language modeling and does not extend to fine-tuning or other modalities such as computer vision and multi-modal tasks. The specialization analysis is based on expert selection frequency without further downstream validation. Future work may explore methods to incorporate more diverse sequences within individual micro-batches and extend the approach to other domains.

Overall, the paper offers a rigorous analysis and a well-motivated modification to MoE training that addresses an overlooked limitation in LBL computation. By shifting from a micro-batch to a global-batch perspective, the approach not only improves performance metrics and decreases perplexity but also fosters clearer expert specialization across domains.