Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints
This paper introduces "Sparse Upcycling," a method to transition from dense pretrained neural network models to sparsely activated Mixture-of-Experts (MoE) configurations using existing checkpoints. As deep neural networks continue to grow in size and complexity, their training costs become a crucial consideration. While sparse models offer resource gains by activating a fraction of their parameters, training them from scratch is still expensive. This paper proposes an efficient alternative by leveraging dense models.
Key Contributions
The paper's central innovation is initializing a sparsely activated MoE model from a dense network checkpoint. This method reuses the computation already invested in the dense model, reducing the need for resources. Notably, upcycled models achieve superior performance compared to their densely-trained counterparts, approximately halving the required pretraining computation cost.
Numerical Results:
- Upcycled T5 models (Base, Large, XL) and Vision Transformers outperform dense counterparts on metrics like SuperGLUE and ImageNet.
- Upcycled models also surpass sparse models trained from scratch, even exceeding dense computation budgets by 50%.
The efficiency of this approach is demonstrated across various model architectures and sizes, revealing significant improvements in upstream (pretraining) and downstream (finetuning) performance metrics.
Practical and Theoretical Implications
Practically, this work provides a pathway for researchers constrained by computational resources to access high-performance models. The proposed strategy exploits prior investments in dense models and upgrades them to handle more extensive and complex tasks without proportionally increasing compute costs.
From a theoretical perspective, this research highlights the potential of rethinking model architecture post-training. It emphasizes a shift in focus from merely scaling parameters to optimizing training paths and resource allocations.
Future Directions
This work may stimulate further exploration into dynamic model architectures, particularly in how networks evolve with varying computational budgets. Potential developments could include adaptive strategies for model sparsity, leveraging upcycled models for domain-specific tasks, and further improvements in routing algorithms to enhance MoE efficiency.
The paper positions Sparse Upcycling as a promising direction in AI research, one that addresses both the scalability challenges and resource constraints inherent in modern deep learning. This methodically contributes to the broader goal of making advanced AI tools more accessible and efficient.