Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints (2212.05055v2)

Published 9 Dec 2022 in cs.LG, cs.CL, and cs.CV

Abstract: Training large, deep neural networks to convergence can be prohibitively expensive. As a result, often only a small selection of popular, dense models are reused across different contexts and tasks. Increasingly, sparsely activated models, which seek to decouple model size from computation costs, are becoming an attractive alternative to dense models. Although more efficient in terms of quality and computation cost, sparse models remain data-hungry and costly to train from scratch in the large scale regime. In this work, we propose sparse upcycling -- a simple way to reuse sunk training costs by initializing a sparsely activated Mixture-of-Experts model from a dense checkpoint. We show that sparsely upcycled T5 Base, Large, and XL LLMs and Vision Transformer Base and Large models, respectively, significantly outperform their dense counterparts on SuperGLUE and ImageNet, using only ~50% of the initial dense pretraining sunk cost. The upcycled models also outperform sparse models trained from scratch on 100% of the initial dense pretraining computation budget.

PDF Abstract

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

This paper introduces "Sparse Upcycling," a method to transition from dense pretrained neural network models to sparsely activated Mixture-of-Experts (MoE) configurations using existing checkpoints. As deep neural networks continue to grow in size and complexity, their training costs become a crucial consideration. While sparse models offer resource gains by activating a fraction of their parameters, training them from scratch is still expensive. This paper proposes an efficient alternative by leveraging dense models.

Key Contributions

The paper's central innovation is initializing a sparsely activated MoE model from a dense network checkpoint. This method reuses the computation already invested in the dense model, reducing the need for resources. Notably, upcycled models achieve superior performance compared to their densely-trained counterparts, approximately halving the required pretraining computation cost.

Numerical Results:

Upcycled T5 models (Base, Large, XL) and Vision Transformers outperform dense counterparts on metrics like SuperGLUE and ImageNet.
Upcycled models also surpass sparse models trained from scratch, even exceeding dense computation budgets by 50%.

The efficiency of this approach is demonstrated across various model architectures and sizes, revealing significant improvements in upstream (pretraining) and downstream (finetuning) performance metrics.

Practical and Theoretical Implications

Practically, this work provides a pathway for researchers constrained by computational resources to access high-performance models. The proposed strategy exploits prior investments in dense models and upgrades them to handle more extensive and complex tasks without proportionally increasing compute costs.

From a theoretical perspective, this research highlights the potential of rethinking model architecture post-training. It emphasizes a shift in focus from merely scaling parameters to optimizing training paths and resource allocations.

Future Directions

This work may stimulate further exploration into dynamic model architectures, particularly in how networks evolve with varying computational budgets. Potential developments could include adaptive strategies for model sparsity, leveraging upcycled models for domain-specific tasks, and further improvements in routing algorithms to enhance MoE efficiency.

The paper positions Sparse Upcycling as a promising direction in AI research, one that addresses both the scalability challenges and resource constraints inherent in modern deep learning. This methodically contributes to the broader goal of making advanced AI tools more accessible and efficient.