Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts (2408.15664v1)

Published 28 Aug 2024 in cs.LG and cs.CL

Abstract: For Mixture-of-Experts (MoE) models, an unbalanced expert load will lead to routing collapse or increased computational overhead. Existing methods commonly employ an auxiliary loss to encourage load balance, but a large auxiliary loss will introduce non-negligible interference gradients into training and thus impair the model performance. In order to control load balance while not producing undesired gradients during training, we propose Loss-Free Balancing, featured by an auxiliary-loss-free load balancing strategy. To be specific, before the top-K routing decision, Loss-Free Balancing will first apply an expert-wise bias to the routing scores of each expert. By dynamically updating the bias of each expert according to its recent load, Loss-Free Balancing can consistently maintain a balanced distribution of expert load. In addition, since Loss-Free Balancing does not produce any interference gradients, it also elevates the upper bound of model performance gained from MoE training. We validate the performance of Loss-Free Balancing on MoE models with up to 3B parameters trained on up to 200B tokens. Experimental results show that Loss-Free Balancing achieves both better performance and better load balance compared with traditional auxiliary-loss-controlled load balancing strategies.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces Loss-Free Balancing, a method that dynamically adjusts expert biases to evenly distribute loads without auxiliary losses.
It demonstrates significant improvements in load balance and lower perplexity through experiments on MoE models with up to 3 billion parameters.
The approach maintains causal constraints and avoids gradient interference, providing an efficient framework for scaling large language models.

Auxiliary-Loss-Free Load Balancing Strategy for Mixture-of-Experts

Abstract

Mixture-of-Experts (MoE) models are prominent in handling computational costs while scaling up parameters in LLMs. The paper presents a novel strategy, termed Loss-Free Balancing, to manage load balancing in MoE models without employing auxiliary losses commonly used in existing methods. This approach applies a dynamic, expert-wise bias to routing scores, ensuring balanced expert loads without introducing interference gradients detrimental to model performance. The paper demonstrates that Loss-Free Balancing yields improved performance and load balance compared to traditional auxiliary-loss-controlled methods, validated through extensive experiments on MoE models with up to 3 billion parameters.

Introduction

Mixture-of-Experts architectures represent a significant development for scaling LLMs, particularly within Transformer-based models. They offer a scalable solution to augmenting model parameters without proportionally increasing computational costs.

However, MoE models face challenges related to load imbalance among experts. Load imbalance can lead to routing collapse, where only a few experts are consistently selected, or increase computational overhead due to uneven distribution of the computational burden.

Traditional methods to address load imbalance typically employ auxiliary losses aimed at distributing the load more evenly. Yet, these auxiliary losses add undesired gradients to the training process, conflicting with the primary LLMing objective and impairing model performance. This paper proposes a novel approach, Loss-Free Balancing, which dynamically adjusts routing scores without introducing these interference gradients.

Background

Current MoE models leverage Top-K routing to select a limited number of experts for processing each token. Load imbalance in these selections can lead to significant performance degradation. Traditional approaches incorporate auxiliary losses, designed to encourage an even distribution of the load among experts. However, these strategies introduce additional gradients that interfere with the primary task gradients, complicating the training process.

The main challenge, as identified in the paper, is the trade-off between maintaining load balance and ensuring optimal model performance. Higher auxiliary loss coefficients improve load balance but at the cost of LLMing performance.

Loss-Free Balancing Strategy

The Loss-Free Balancing strategy aims to maintain load balance through an iterative process involving bias adjustments rather than auxiliary loss. Before making top-K routing decisions, an expert-wise bias is applied to each expert’s routing score. These biases are updated dynamically based on the experts' recent loads: biases for heavily loaded experts are decreased, while those for lightly loaded experts are increased.

This dynamic bias adjustment ensures that the gating scores can consistently redistribute the load more evenly across experts. Notably, this process excludes the introduction of any gradients other than those necessary for the LLMing objective, thereby maintaining the integrity of the primary training task.

Experimental Setup

Model Architecture: The experiments utilize the DeepSeekMoE architecture, with modifications including the use of sigmoid gates instead of softmax due to better baseline performance. Two model sizes are evaluated: 1 billion parameters (1B) and 3 billion parameters (3B).
Training Settings: The models are trained from scratch on datasets comprising 100 billion and 200 billion tokens for the 1B and 3B models, respectively. A cosine learning rate scheduler with warmup is applied to manage the learning rate during training.
Metrics: Performance is measured using perplexity on a validation set. Load balance is assessed using a metric called maximal violation (MaxVio), quantifying the deviation from perfect load balance.

Results

Empirical results indicate that Loss-Free Balancing, when compared to auxiliary-loss-controlled methods, achieves:

Lower Perplexity: Indicates better model performance.
Better Load Balance: Demonstrates more even distribution of computational loads among experts.

The method proved successful in maintaining a consistent balance throughout training, avoiding the issues seen in auxiliary-loss methods. Additionally, the approach is naturally compatible with expert parallelism, a crucial feature for training extremely large MoE models.

Discussion

The paper compares Loss-Free Balancing with other load balancing methods, particularly Expert Choice (EC). While EC ensures perfect load balance, it violates the causal constraint of LLMing, leading to potential future token leakage. This leakage compromises the model’s generalization ability and reliability of performance evaluations.

Loss-Free Balancing avoids this pitfall by adhering strictly to causal constraints, ensuring no leakage and maintaining model integrity. The superior performance and balanced load demonstrate the efficacy of the proposed approach, providing a robust framework for scaling MoE models efficiently.

Conclusion

Loss-Free Balancing introduces a significant advancement in load balancing for Mixture-of-Experts architectures. By eliminating auxiliary losses and dynamically adjusting expert-wise biases, it ensures improved model performance and balanced distribution of computational loads. This method offers a promising solution for scaling up MoE models while preserving training efficiency and model integrity. Future work might focus on further optimizing the bias update mechanisms and exploring its applications in even larger-scale models.

PDF Markdown

Follow-up Questions

Related Papers

Authors (5)

Tweets

https://twitter.com/deepseek_ai/status/1829140827127292246

https://twitter.com/nrehiew_/status/1872318173648736381

https://twitter.com/Nottlespike/status/1872448047671898258

https://twitter.com/Ar_Douillard/status/1879645544584540526

https://twitter.com/Grad62304977/status/1853154622590034178

https://twitter.com/shxf0072/status/1914953618987602038

YouTube

Show All Videos