Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping (2404.19429v1)
Abstract: The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters. However, it faces the challenge of extended all-to-all communication latency during the training process. Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation. Yet, these methods frequently fall short of achieving sufficient overlap, consequently restricting the potential for performance enhancements. In our study, we extend the scope of this challenge by considering overlap at the broader training graph level. During the forward pass, we enable non-MoE computations to overlap with all-to-all through careful partitioning and pipelining. In the backward pass, we achieve overlap with all-to-all by scheduling gradient weight computations. We implement these techniques in Lancet, a system using compiler-based optimization to automatically enhance MoE model training. Our extensive evaluation reveals that Lancet significantly reduces the time devoted to non-overlapping communication, by as much as 77%. Moreover, it achieves a notable end-to-end speedup of up to 1.3 times when compared to the state-of-the-art solutions.
- TVM: An automated end-to-end optimizing compiler for deep learning. In Proc. of OSDI, pp. 578–594, 2018.
- Sparse moe as the new dropout: Scaling dense and self-slimmable transformers. In Proc. of ICLR, 2023.
- DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint 2401.06066, 2024.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- MegaBlocks: Efficient Sparse Training with Mixture-of-Experts. Proc. of MLSys, 5, 2023.
- FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models. In Proc. of PPoPP, pp. 120–134, 2022.
- Tutel: Adaptive mixture-of-experts at scale. Proc. of MLSys, 5, 2023.
- Priority-based parameter propagation for distributed dnn training. In Proc. of MLSys, 2019.
- Mixtral of experts. arXiv preprint 2401.04088, 2024.
- GShard: Scaling giant models with conditional computation and automatic sharding. In Proc. of ICLR, 2020.
- Accelerating distributed MoE training and inference with Lina. In Proc. of ATC, pp. 945–959, 2023a.
- Automated tensor model parallelism with overlapped communication for efficient foundation model training. arXiv preprint 2305.16121, 2023b.
- M6-10t: A sharing-delinking paradigm for efficient multi-trillion parameter pretraining. arXiv preprint 2110.03888, 2021.
- Ring attention with blockwise transformers for near-infinite context. arXiv preprint 2310.01889, 2023.
- Knapsack problems: algorithms and computer implementations. John Wiley & Sons, Inc., 1990.
- Pointer sentinel mixture models. arXiv preprint 1609.07843, 2016.
- Message Passing Interface Forum. MPI: A Message-Passing Interface Standard Version 4.0, June 2021. URL https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf.
- HetuMoE: An efficient trillion-scale mixture-of-expert distributed training system. arXiv preprint 2203.14685, 2022.
- NVIDIA. NCCL, 2021. https://developer.nvidia.com/nccl.
- PyTorch: An imperative style, high-performance deep learning library. In Proc. of NeurIPS, pp. 8024–8035, 2019.
- A generic communication scheduler for distributed dnn training acceleration. In Proc. of SOSP, pp. 16–29, 2019.
- OR-Tools. https://developers.google.com/optimization/, 2019.
- Language Models are Unsupervised Multitask Learners, 2019. https://openai.com/blog/better-language-models/.
- ZeRO: Memory optimizations toward training trillion parameter models. In Proc. of SC, pp. 1–16, 2020.
- DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale. In Proc. of ICML, volume 162, pp. 18332–18346, 2022.
- DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proc. of KDD, pp. 3505–3506, 2020.
- Scaling vision with sparse mixture of experts. Proc. of NeurIPS, 34:8583–8595, 2021.
- Hash layers for large sparse models. In Proc. of NeurIPS, volume 34, pp. 17555–17566, 2021.
- Glow: Graph lowering compiler techniques for neural networks. arXiv preprint 1805.00907, 2018.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Proc. of ICLR, 2017.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint 1909.08053, 2019.
- Overlap communication with dependent computation via decomposition in large deep learning models. In Proc. of ASPLOS, pp. 93–106, 2022.
- Transformers: State-of-the-art natural language processing. In Proc. of EMNLP, pp. 38–45, 2020.
- M6-t: Exploring sparse expert models and beyond. arXiv preprint 2105.15082, 2021.
- RAF: Holistic compilation for deep learning model training. arXiv preprint 2303.04759, 2023.
- Accelerating large-scale distributed neural network training with SPMD parallelism. In Proc. of SoCC, pp. 403–418, 2022.
- Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. In Proc. of OSDI, pp. 559–578, 2022.
- Mixture-of-experts with expert choice routing. In Proc. of NeurIPS, 2022.
- Taming sparsely activated transformer with stochastic experts. In Proc. of ICLR, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.