Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (2101.03961v3)

Published 11 Jan 2021 in cs.LG and cs.AI

Abstract: In deep learning, models typically reuse the same parameters for all inputs. Mixture of Experts (MoE) defies this and instead selects different parameters for each incoming example. The result is a sparsely-activated model -- with outrageous numbers of parameters -- but a constant computational cost. However, despite several notable successes of MoE, widespread adoption has been hindered by complexity, communication costs and training instability -- we address these with the Switch Transformer. We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. Our proposed training techniques help wrangle the instabilities and we show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources. These improvements extend into multilingual settings where we measure gains over the mT5-Base version across all 101 languages. Finally, we advance the current scale of LLMs by pre-training up to trillion parameter models on the "Colossal Clean Crawled Corpus" and achieve a 4x speedup over the T5-XXL model.

PDF Abstract

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

Abstract

The paper, authored by Fedus, Zoph, and Shazeer, introduces the Switch Transformer, a Mixture of Experts (MoE) model that scales efficiently to trillion-parameter architectures. It addresses key challenges in existing MoE models such as complexity, communication overhead, and training instability. Notably, the Switch Transformer achieves substantial increases in training speed and scale, leveraging sparse activation to maintain computational efficiency. The model's innovations extend to multilingual settings, achieving significant performance gains across multiple languages.

Introduction

The Switch Transformer employs sparsity by activating only a subset of the model's parameters for each input, diverging from the standard approach where the same parameters are reused across all inputs. The core motivation is to balance the computational efficiencies gained from sparse activations with the enhanced capacity from substantial parameter counts. This addresses the inefficiencies in traditional dense architectures where expanding model sizes directly correlates to increased FLOPs.

Architecture and Training

The Switch Transformer simplifies MoE routing to reduce computational and communication overhead. Notably, it introduces several training techniques to mitigate instability, such as selective precision and novel initialization schemes. The architecture integrates up to 128 experts in T5-Base and T5-Large models, achieving up to 7x pre-training speedups using the same computational resources.

Empirical Results

Scalability:
- Sparse models, even with increasing parameters, maintain constant FLOPs due to the dynamic selection of parameters per token.
- Notable efficiency of large models: the Switch-XXL (395B parameters) and Switch-C (1.6T parameters) demonstrate over 4x speedup in pre-training compared to dense T5-XXL (11B parameters), with superior perplexity metrics.
Training Techniques:
- Selective Precision: Converting critical computations in routing to float32 while retaining bfloat16 for other operations stabilizes training without a significant speed penalty.
- Initialization: Reducing the initialization scale markedly improves training stability, crucial for models exceeding hundreds of billion parameters.
Multilingual Efficiency:
- The Switch Transformer uniformly improves performance across 101 languages, showing a 4x speedup in 91% of the languages, when pre-training on the mC4 dataset.
Distillation:
- The paper also explores distillation techniques to compress large sparse models into smaller dense versions, preserving up to 30% of the quality gains. This makes deployment feasible even for models with trillions of parameters.

Practical Implications

The Switch Transformer redefines the paradigms in large-scale model training, making trillion-parameter models practical without the prohibitive computational costs typically associated with such scales. It demonstrates that models can significantly benefit from increased parameter counts without proportional increases in computational demands, given efficient sparsity management. This translates to enhanced performance across a gamut of NLP tasks, ranging from sentiment analysis to complex reasoning and multi-language learning.

Future Directions

While the presented architecture and techniques are robust, further enhancements are suggested:

Enhanced stability for extremely large models, particularly those employing extensive model parallelism.
Deeper understanding and refinement of fine-tuning protocols to maximize downstream task performance.
Improved adaptation strategies for deploying such large models in diverse computational environments.

Conclusion

The Switch Transformer marks a significant stride in scaling model architectures efficiently. Its design and empirical validation underscore the versatility and efficacy of sparsity in large-scale neural networks. These findings pivotally contribute to advancing current AI capabilities, setting a new precedent for the development and application of exceedingly large and efficient models.