Revisiting MoE and Dense Speed-Accuracy Comparisons for LLM Training (2405.15052v2)
Abstract: Mixture-of-Experts (MoE) enjoys performance gain by increasing model capacity while keeping computation cost constant. When comparing MoE to dense models, prior work typically adopt the following setting: 1) use FLOPs or activated parameters as a measure of model complexity; 2) train all models to the same number of tokens. We argue that this setting favors MoE as FLOPs and activated parameters do not accurately measure the communication overhead in sparse layers, leading to a larger actual training budget for MoE. In this work, we revisit the settings by adopting step time as a more accurate measure of model complexity, and by determining the total compute budget under the Chinchilla compute-optimal settings. To efficiently run MoE on modern accelerators, we adopt a 3D sharding method that keeps the dense-to-MoE step time increase within a healthy range. We evaluate MoE and dense LLMs on a set of nine 0-shot and two 1-shot English tasks, as well as MMLU 5-shot and GSM8K 8-shot across three model scales at 6.4B, 12.6B, and 29.6B. Experimental results show that even under these settings, MoE consistently outperform dense LLMs on the speed-accuracy trade-off curve with meaningful gaps. Our full model implementation and sharding strategy has been released at~\url{https://github.com/apple/axlearn}
- Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023.
- Semantic parsing on freebase from question-answer pairs. In Conference on Empirical Methods in Natural Language Processing, 2013. URL https://api.semanticscholar.org/CorpusID:6401679.
- Piqa: Reasoning about physical commonsense in natural language, 2019.
- Adamv-moe: Adaptive multi-task vision mixture-of-experts. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17346–17357, October 2023.
- Unified scaling laws for routed language models, 2022.
- Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
- Training verifiers to solve math word problems, 2021.
- Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models, 2024.
- Mobile v-moes: Scaling down vision transformers via sparse mixture-of-experts, 2023.
- Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
- GLaM: Efficient scaling of language models with mixture-of-experts. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5547–5569. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/du22c.html.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity, 2022.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Training compute-optimal large language models, 2022.
- Mixtral of experts, 2024.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017.
- Sparse upcycling: Training mixture-of-experts from dense checkpoints. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=T5nUQDrM4u.
- Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, 2018.
- {GS}hard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=qrwe7XHTmYb.
- Moe-llava: Mixture of experts for large vision-language models, 2024.
- Decoupled weight decay regularization, 2019.
- Multimodal contrastive learning with LIMoe: the language-image mixture of experts. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=Qy1D9JyMBg0.
- The lambada dataset: Word prediction requiring a broad discourse context, 2016.
- Scaling vision with sparse mixture of experts. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=NGPmH3vbAA_.
- Winogrande: An adversarial winograd schema challenge at scale, 2019.
- Socialiqa: Commonsense reasoning about social interactions, 2019.
- Neural machine translation of rare words with subword units, 2016.
- Noam Shazeer. Glu variants improve transformer, 2020.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=B1ckMDqlg.
- Mixture-of-experts meets instruction tuning:a winning combination for large language models, 2023.
- Roformer: Enhanced transformer with rotary position embedding, 2023.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Gspmd: General and scalable parallelization for ml computation graphs, 2021.
- Hellaswag: Can a machine really finish your sentence?, 2019.
- Root Mean Square Layer Normalization. In Advances in Neural Information Processing Systems 32, Vancouver, Canada, 2019. URL https://openreview.net/references/pdf?id=S1qBAf6rr.
- Pytorch fsdp: Experiences on scaling fully sharded data parallel, 2023.
- Mixture-of-experts with expert choice routing. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 7103–7114. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/2f00ecd787b432c1d36f3de9800728eb-Paper-Conference.pdf.
- St-moe: Designing stable and transferable sparse expert models, 2022.
- Xianzhi Du (30 papers)
- Tom Gunter (13 papers)
- Xiang Kong (31 papers)
- Mark Lee (14 papers)
- Zirui Wang (83 papers)
- Aonan Zhang (32 papers)
- Nan Du (66 papers)
- Ruoming Pang (59 papers)