- The paper introduces Loss-Free Balancing, a method that dynamically adjusts expert biases to evenly distribute loads without auxiliary losses.
- It demonstrates significant improvements in load balance and lower perplexity through experiments on MoE models with up to 3 billion parameters.
- The approach maintains causal constraints and avoids gradient interference, providing an efficient framework for scaling large language models.
Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts
Abstract
Mixture-of-Experts (MoE) models are prominent in handling computational costs while scaling up parameters in LLMs. The paper presents a novel strategy, termed Loss-Free Balancing, to manage load balancing in MoE models without employing auxiliary losses commonly used in existing methods. This approach applies a dynamic, expert-wise bias to routing scores, ensuring balanced expert loads without introducing interference gradients detrimental to model performance. The paper demonstrates that Loss-Free Balancing yields improved performance and load balance compared to traditional auxiliary-loss-controlled methods, validated through extensive experiments on MoE models with up to 3 billion parameters.
Introduction
Mixture-of-Experts architectures represent a significant development for scaling LLMs, particularly within Transformer-based models. They offer a scalable solution to augmenting model parameters without proportionally increasing computational costs.
However, MoE models face challenges related to load imbalance among experts. Load imbalance can lead to routing collapse, where only a few experts are consistently selected, or increase computational overhead due to uneven distribution of the computational burden.
Traditional methods to address load imbalance typically employ auxiliary losses aimed at distributing the load more evenly. Yet, these auxiliary losses add undesired gradients to the training process, conflicting with the primary LLMing objective and impairing model performance. This paper proposes a novel approach, Loss-Free Balancing, which dynamically adjusts routing scores without introducing these interference gradients.
Background
Current MoE models leverage Top-K routing to select a limited number of experts for processing each token. Load imbalance in these selections can lead to significant performance degradation. Traditional approaches incorporate auxiliary losses, designed to encourage an even distribution of the load among experts. However, these strategies introduce additional gradients that interfere with the primary task gradients, complicating the training process.
The main challenge, as identified in the paper, is the trade-off between maintaining load balance and ensuring optimal model performance. Higher auxiliary loss coefficients improve load balance but at the cost of LLMing performance.
Loss-Free Balancing Strategy
The Loss-Free Balancing strategy aims to maintain load balance through an iterative process involving bias adjustments rather than auxiliary loss. Before making top-K routing decisions, an expert-wise bias is applied to each expert’s routing score. These biases are updated dynamically based on the experts' recent loads: biases for heavily loaded experts are decreased, while those for lightly loaded experts are increased.
This dynamic bias adjustment ensures that the gating scores can consistently redistribute the load more evenly across experts. Notably, this process excludes the introduction of any gradients other than those necessary for the LLMing objective, thereby maintaining the integrity of the primary training task.
Experimental Setup
- Model Architecture: The experiments utilize the DeepSeekMoE architecture, with modifications including the use of sigmoid gates instead of softmax due to better baseline performance. Two model sizes are evaluated: 1 billion parameters (1B) and 3 billion parameters (3B).
- Training Settings: The models are trained from scratch on datasets comprising 100 billion and 200 billion tokens for the 1B and 3B models, respectively. A cosine learning rate scheduler with warmup is applied to manage the learning rate during training.
- Metrics: Performance is measured using perplexity on a validation set. Load balance is assessed using a metric called maximal violation (MaxVio), quantifying the deviation from perfect load balance.
Results
Empirical results indicate that Loss-Free Balancing, when compared to auxiliary-loss-controlled methods, achieves:
- Lower Perplexity: Indicates better model performance.
- Better Load Balance: Demonstrates more even distribution of computational loads among experts.
The method proved successful in maintaining a consistent balance throughout training, avoiding the issues seen in auxiliary-loss methods. Additionally, the approach is naturally compatible with expert parallelism, a crucial feature for training extremely large MoE models.
Discussion
The paper compares Loss-Free Balancing with other load balancing methods, particularly Expert Choice (EC). While EC ensures perfect load balance, it violates the causal constraint of LLMing, leading to potential future token leakage. This leakage compromises the model’s generalization ability and reliability of performance evaluations.
Loss-Free Balancing avoids this pitfall by adhering strictly to causal constraints, ensuring no leakage and maintaining model integrity. The superior performance and balanced load demonstrate the efficacy of the proposed approach, providing a robust framework for scaling MoE models efficiently.
Conclusion
Loss-Free Balancing introduces a significant advancement in load balancing for Mixture-of-Experts architectures. By eliminating auxiliary losses and dynamically adjusting expert-wise biases, it ensures improved model performance and balanced distribution of computational loads. This method offers a promising solution for scaling up MoE models while preserving training efficiency and model integrity. Future work might focus on further optimizing the bias update mechanisms and exploring its applications in even larger-scale models.