From Sparse to Soft Mixtures of Experts (2308.00951v2)

Published 2 Aug 2023 in cs.LG, cs.AI, and cs.CV

Abstract: Sparse mixture of expert architectures (MoEs) scale model capacity without significant increases in training or inference costs. Despite their success, MoEs suffer from a number of issues: training instability, token dropping, inability to scale the number of experts, or ineffective finetuning. In this work, we propose Soft MoE, a fully-differentiable sparse Transformer that addresses these challenges, while maintaining the benefits of MoEs. Soft MoE performs an implicit soft assignment by passing different weighted combinations of all input tokens to each expert. As in other MoEs, experts in Soft MoE only process a subset of the (combined) tokens, enabling larger model capacity (and performance) at lower inference cost. In the context of visual recognition, Soft MoE greatly outperforms dense Transformers (ViTs) and popular MoEs (Tokens Choice and Experts Choice). Furthermore, Soft MoE scales well: Soft MoE Huge/14 with 128 experts in 16 MoE layers has over 40x more parameters than ViT Huge/14, with only 2% increased inference time, and substantially better quality.

PDF Abstract

From Sparse to Soft Mixtures of Experts

The paper, "From Sparse to Soft Mixtures of Experts," presents a novel architecture called the Soft Mixture of Experts (Soft MoE), designed to enhance the efficiency and scalability of Transformer-based models by addressing several limitations associated with traditional Sparse Mixtures of Experts (MoEs).

Sparse MoEs are advantageous for scaling model capacity without significantly increasing computation costs. However, they struggle with challenges such as training instability, token dropping, ineffective fine-tuning, and difficulty in scaling the number of experts.

Key Contributions

The paper introduces Soft MoE as a fully differentiable sparse Transformer model that overcomes these issues while preserving the benefits of traditional MoEs. The primary innovation in Soft MoE is its approach to routing and processing tokens. Instead of employing discrete and sparse routing, Soft MoE utilizes an implicit soft assignment, creating weighted combinations of input tokens for each expert. This leads to several significant enhancements:

Improved Token Utilization: By computing weighted averages of all tokens and processing them across experts, Soft MoE eliminates token dropping and ensures better resource utilization and balance among experts.
Scalability: Soft MoE effectively scales to thousands of experts, providing substantial parameter growth with minimal inference time increase. For instance, a Soft MoE model with 128 experts shows over 40 times more parameters than a baseline model, with only a 2% increase in inference cost.
Performance Gains: Empirical results highlight that Soft MoE models outperform standard Vision Transformers (ViTs) and other MoE variants in visual recognition tasks. For example, the Soft MoE-Base/16 achieves similar performance to ViT-Huge/14 while requiring significantly lower inference cost.

Numerical Results and Performance

The paper reports that Soft MoE offers a 10.5× reduction in inference cost and a 5.7× reduction in wall-clock time compared to ViT-Huge/14, making it a highly efficient alternative. This reduction in computational overhead does not come at the expense of performance, as Soft MoE models achieve competitive or superior results in upstream, few-shot, and fine-tuning tasks.

Moreover, the model's capabilities are further demonstrated through extensive experiments on image classification and contrastive learning tasks. These experiments confirm Soft MoE's potential to maintain the strengths of sparse architectures while substantially reducing their weaknesses.

Implications and Future Directions

The introduction of Soft MoE marks a significant development in the optimization of Transformer architectures. By addressing the discrete nature of traditional MoEs, Soft MoE paves the way for robust and scalable models that are potentially applicable across various domains beyond visual recognition.

The theoretical implications suggest a reevaluation of how routing and expert assignment are approached in large neural models. Practically, the substantial reduction in inference time poses promising opportunities for deployment in resource-constrained environments.

Future research could explore extending Soft MoE to auto-regressive tasks, potentially unifying diverse model architectures under a single efficient framework. Further research into optimizing the soft routing mechanisms and slot utilization could yield even greater efficiency improvements.

In conclusion, Soft MoE offers a compelling solution to the limitations of traditional MoEs, providing a promising pathway forward in scalable and efficient model design for complex tasks.