Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 92 tok/s Pro
Kimi K2 193 tok/s Pro
GPT OSS 120B 439 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Lancet: Accelerating Mixture-of-Experts Training via Whole Graph Computation-Communication Overlapping (2404.19429v1)

Published 30 Apr 2024 in cs.DC and cs.LG

Abstract: The Mixture-of-Expert (MoE) technique plays a crucial role in expanding the size of DNN model parameters. However, it faces the challenge of extended all-to-all communication latency during the training process. Existing methods attempt to mitigate this issue by overlapping all-to-all with expert computation. Yet, these methods frequently fall short of achieving sufficient overlap, consequently restricting the potential for performance enhancements. In our study, we extend the scope of this challenge by considering overlap at the broader training graph level. During the forward pass, we enable non-MoE computations to overlap with all-to-all through careful partitioning and pipelining. In the backward pass, we achieve overlap with all-to-all by scheduling gradient weight computations. We implement these techniques in Lancet, a system using compiler-based optimization to automatically enhance MoE model training. Our extensive evaluation reveals that Lancet significantly reduces the time devoted to non-overlapping communication, by as much as 77%. Moreover, it achieves a notable end-to-end speedup of up to 1.3 times when compared to the state-of-the-art solutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. TVM: An automated end-to-end optimizing compiler for deep learning. In Proc. of OSDI, pp.  578–594, 2018.
  2. Sparse moe as the new dropout: Scaling dense and self-slimmable transformers. In Proc. of ICLR, 2023.
  3. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint 2401.06066, 2024.
  4. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  5. MegaBlocks: Efficient Sparse Training with Mixture-of-Experts. Proc. of MLSys, 5, 2023.
  6. FasterMoE: modeling and optimizing training of large-scale dynamic pre-trained models. In Proc. of PPoPP, pp.  120–134, 2022.
  7. Tutel: Adaptive mixture-of-experts at scale. Proc. of MLSys, 5, 2023.
  8. Priority-based parameter propagation for distributed dnn training. In Proc. of MLSys, 2019.
  9. Mixtral of experts. arXiv preprint 2401.04088, 2024.
  10. GShard: Scaling giant models with conditional computation and automatic sharding. In Proc. of ICLR, 2020.
  11. Accelerating distributed MoE training and inference with Lina. In Proc. of ATC, pp.  945–959, 2023a.
  12. Automated tensor model parallelism with overlapped communication for efficient foundation model training. arXiv preprint 2305.16121, 2023b.
  13. M6-10t: A sharing-delinking paradigm for efficient multi-trillion parameter pretraining. arXiv preprint 2110.03888, 2021.
  14. Ring attention with blockwise transformers for near-infinite context. arXiv preprint 2310.01889, 2023.
  15. Knapsack problems: algorithms and computer implementations. John Wiley & Sons, Inc., 1990.
  16. Pointer sentinel mixture models. arXiv preprint 1609.07843, 2016.
  17. Message Passing Interface Forum. MPI: A Message-Passing Interface Standard Version 4.0, June 2021. URL https://www.mpi-forum.org/docs/mpi-4.0/mpi40-report.pdf.
  18. HetuMoE: An efficient trillion-scale mixture-of-expert distributed training system. arXiv preprint 2203.14685, 2022.
  19. NVIDIA. NCCL, 2021. https://developer.nvidia.com/nccl.
  20. PyTorch: An imperative style, high-performance deep learning library. In Proc. of NeurIPS, pp.  8024–8035, 2019.
  21. A generic communication scheduler for distributed dnn training acceleration. In Proc. of SOSP, pp.  16–29, 2019.
  22. OR-Tools. https://developers.google.com/optimization/, 2019.
  23. Language Models are Unsupervised Multitask Learners, 2019. https://openai.com/blog/better-language-models/.
  24. ZeRO: Memory optimizations toward training trillion parameter models. In Proc. of SC, pp.  1–16, 2020.
  25. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale. In Proc. of ICML, volume 162, pp.  18332–18346, 2022.
  26. DeepSpeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proc. of KDD, pp.  3505–3506, 2020.
  27. Scaling vision with sparse mixture of experts. Proc. of NeurIPS, 34:8583–8595, 2021.
  28. Hash layers for large sparse models. In Proc. of NeurIPS, volume 34, pp.  17555–17566, 2021.
  29. Glow: Graph lowering compiler techniques for neural networks. arXiv preprint 1805.00907, 2018.
  30. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Proc. of ICLR, 2017.
  31. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint 1909.08053, 2019.
  32. Overlap communication with dependent computation via decomposition in large deep learning models. In Proc. of ASPLOS, pp.  93–106, 2022.
  33. Transformers: State-of-the-art natural language processing. In Proc. of EMNLP, pp.  38–45, 2020.
  34. M6-t: Exploring sparse expert models and beyond. arXiv preprint 2105.15082, 2021.
  35. RAF: Holistic compilation for deep learning model training. arXiv preprint 2303.04759, 2023.
  36. Accelerating large-scale distributed neural network training with SPMD parallelism. In Proc. of SoCC, pp.  403–418, 2022.
  37. Alpa: Automating inter- and intra-operator parallelism for distributed deep learning. In Proc. of OSDI, pp.  559–578, 2022.
  38. Mixture-of-experts with expert choice routing. In Proc. of NeurIPS, 2022.
  39. Taming sparsely activated transformer with stochastic experts. In Proc. of ICLR, 2022.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 2 tweets and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper:

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube