ReMoE: Fully Differentiable Mixture-of-Experts with ReLU Routing (2412.14711v1)

Published 19 Dec 2024 in cs.LG

Abstract: Sparsely activated Mixture-of-Experts (MoE) models are widely adopted to scale up model capacity without increasing the computation budget. However, vanilla TopK routers are trained in a discontinuous, non-differentiable way, limiting their performance and scalability. To address this issue, we propose ReMoE, a fully differentiable MoE architecture that offers a simple yet effective drop-in replacement for the conventional TopK+Softmax routing, utilizing ReLU as the router instead. We further propose methods to regulate the router's sparsity while balancing the load among experts. ReMoE's continuous nature enables efficient dynamic allocation of computation across tokens and layers, while also exhibiting domain specialization. Our experiments demonstrate that ReMoE consistently outperforms vanilla TopK-routed MoE across various model sizes, expert counts, and levels of granularity. Furthermore, ReMoE exhibits superior scalability with respect to the number of experts, surpassing traditional MoE architectures. The implementation based on Megatron-LM is available at https://github.com/thu-ml/ReMoE.

PDF HTML Abstract

An Evaluation of ReMoE: A Fully Differentiable Mixture-of-Experts Model with ReLU Routing

The paper presents ReMoE (a fully differentiable Mixture-of-Experts (MoE) architecture that leverages ReLU routing to address the inherent discontinuities in traditional TopK routers used in sparse MoE models. ReMoE provides an alternative that promises enhanced performance and scalability by ensuring continuous differentiability, thereby facilitating efficient training and inference.

Key Contributions

ReLU-Based Routing: ReMoE employs ReLU functions for routing, replacing the non-differentiable TopK routing, which traditionally hinders gradient-based optimization. This shift to a differentiable routing mechanism supports smooth transitions between active and inactive states of experts, facilitating seamless gradient flow during training.
Load-Balancing Strategy: To regulate sparsity and manage load imbalances across experts, the authors integrate an $L_1$ regularization term in the routing function. This approach ensures computational efficiency, maintaining the same FLOPs as traditional TopK routing.
Dynamic Resource Allocation: ReMoE demonstrates flexibility in activating different experts for varied tokens and layers, promoting more efficient resource utilization based on token complexity and domain characteristics.
Scalability and Performance: Through comprehensive experiments, ReMoE consistently outperforms traditional TopK-based MoE models across a range of model sizes, expert counts, and levels of granularity. Notably, ReMoE exhibits superior scalability, showing steeper performance gains as the number of experts increases.

Experimental Evaluation

The authors conduct a series of experiments to validate the proposed model across various modalities, including multiple model sizes and expert counts. Validation loss and downstream task performance metrics demonstrate that ReMoE consistently surpasses the standard TopK MoE models. Furthermore, the architecture shows effective scaling characteristics, with ReMoE benefiting significantly from increased expert diversity.

ReMoE's training procedures naturally break into three stages — dense phase, sparsifying phase, and stable phase. This transition helps fine-tune the network's parameter space, guiding the model towards optimal expert specialization without compromising performance.

Implications for AI Research

The introduction of fully differentiable MoE architectures like ReMoE may lead to significant advancements in neural network scalability and efficiency. The flexibility in resource allocation presents a path for developing more specialized models that can adapt computational resources dynamically to task complexity. This adaptability is essential for advancing both autoregressive and non-autoregressive generative models.

Future Directions

Given ReMoE’s performance and adaptability exhibited during experimentation, future exploration could focus on its application in other domains beyond LLMs, including computer vision and multimodal architectures. Additionally, the exploration of hardware accelerations, leveraging the unique characteristics of ReLU-based routing, might yield further performance improvements.

In conclusion, ReMoE advances the field of efficient large model training by resolving the gradient discontinuity issue in traditional MoE models, thereby achieving improved scalability and efficiency. This work underscores the importance of model architecture changes for future large-scale AI systems.