Mixture-of-Experts with Expert Choice Routing (2202.09368v2)

Published 18 Feb 2022 in cs.LG and cs.AI

Abstract: Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2x. For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.

Citations (256)

View on Semantic Scholar

Summary

The paper demonstrates that incorporating expert choice routing in MoE models markedly improves performance compared to dense architectures.
The study reveals that adjusting capacity factors, especially with the EC-CF2 configuration, optimizes expert utilization and fine-tuning outcomes.
Comparative analysis shows that expert choice routing outperforms hashing techniques, offering a more reliable and efficient model scalability strategy.

Mixture-of-Experts with Expert Choice Routing: A Comprehensive Analysis

The paper "Mixture-of-Experts with Expert Choice Routing" presents a detailed exploration of the Mixture-of-Experts (MoE) model, specifically focusing on the implementation of expert choice routing to enhance downstream task performance. This investigation offers a comparative analysis of the MoE model against its dense counterpart, providing robust evidence of its efficacy across a variety of tasks.

Performance Comparison with Dense Models

An essential aspect of this paper is the comparison between the MoE model and a densely trained model, each with 8 billion parameters. The fine-tuning performance was evaluated across 11 tasks in the GLUE and SuperGLUE benchmarks. The results indicate that the MoE model with expert choice routing consistently outperforms the dense architecture. For instance, notable enhancements are observed in tasks such as BoolQ, with a performance increase from 88.2% to 89.2%, and MRPC, from 86.7% to 90.6%. The overall average score improved from 89.2% to 92.6%, demonstrating the substantial gains offered by the MoE model in these contexts.

Examination of Capacity Factor Variations

The research further investigates the impact of varying capacity factors on the fine-tuning performance of the MoE models. The capacity factor represented the average number of experts a token can access. Notably, the EC-CF2 configuration, which emulates GShard’s top-2 gating computational footprint, exhibited optimal performance metrics. Alternative configurations like EC-CAP3, which caps at three experts per token, also showed performance competitive with the baseline EC-BASE model, strengthening the argument for strategic expert allocation over rigid gating mechanisms.

Efficacy of Capped Expert Choice

In regulating the number of experts each token can leverage, the paper introduces an entropy-regularized linear programming approach. This capping strategy, specifically in EC-CAP2 and EC-CAP3, shows performance that surpasses the traditional top-2 gating method in terms of validation perplexity. These findings underscore the potential of a balanced training protocol, wherein minimizing the number of experts can still achieve high validation accuracy without compromising the expressiveness of the model significantly.

Comparative Analysis with Hash Layer Approaches

The paper also benchmarks expert choice routing against Hash Layers to assess its relative performance. Results indicate that the expert choice method outperforms hashing-based routing in terms of fine-tuning outcomes. The superior average scores and lower variance suggest that expert choice routing offers a more reliable and effective specialization of experts than deterministic hashing strategies.

Implications and Future Directions

The strides made in MoE models with tailored routing strategies like expert choice routing highlight notable implications for future AI models, particularly in successfully managing model scalability and efficiency across diverse computational tasks. This methodology enables more efficient resource allocation while maintaining or improving performance levels, which is crucial for deploying models in real-world applications with constrained resources.

Going forward, further exploration could be directed towards refining expert routing mechanisms in varying data and computational environments, or even advancing hybrid models that incorporate novel routing algorithms. These developments could potentially unlock new frontiers in specialized neural networks, potentially transforming the landscape of artificial intelligence applications. Moreover, subsequent research might focus on the practical implementation challenges and solutions regarding integration into existing frameworks, addressing ease of use and scalability.

PDF Markdown

Related Papers

Tweets

https://twitter.com/teortaxesTex/status/1794114666601316385

https://twitter.com/rzidane360/status/1753522865482485835

https://twitter.com/knishimae0531/status/1771499088908792205