Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoESys: A Distributed and Efficient Mixture-of-Experts Training and Inference System for Internet Services (2205.10034v3)

Published 20 May 2022 in cs.DC and cs.AI

Abstract: While modern internet services, such as chatbots, search engines, and online advertising, demand the use of large-scale deep neural networks (DNNs), distributed training and inference over heterogeneous computing systems are desired to facilitate these DNN models. Mixture-of-Experts (MoE) is one the most common strategies to lower the cost of training subject to the overall size of models/data through gating and parallelism in a divide-and-conquer fashion. While DeepSpeed has made efforts in carrying out large-scale MoE training over heterogeneous infrastructures, the efficiency of training and inference could be further improved from several system aspects, including load balancing, communication/computation efficiency, and memory footprint limits. In this work, we present a novel MoESys that boosts efficiency in both large-scale training and inference. Specifically, in the training procedure, the proposed MoESys adopts an Elastic MoE training strategy with 2D prefetch and Fusion communication over Hierarchical storage, so as to enjoy efficient parallelisms. For scalable inference in a single node, especially when the model size is larger than GPU memory, MoESys builds the CPU-GPU memory jointly into a ring of sections to load the model, and executes the computation tasks across the memory sections in a round-robin manner for efficient inference. We carried out extensive experiments to evaluate MoESys, where MoESys successfully trains a Unified Feature Optimization (UFO) model with a Sparsely-Gated Mixture-of-Experts model of 12B parameters in 8 days on 48 A100 GPU cards. The comparison against the state-of-the-art shows that MoESys outperformed DeepSpeed with 33% higher throughput (tokens per second) in training and 13% higher throughput in inference in general. Particularly, under unbalanced MoE Tasks, e.g., UFO, MoESys achieved 64% higher throughput with 18% lower memory footprints.

Introduction

The paper introduces SE-MoE, a framework designed to improve the efficiency and scalability of distributed training and inference of Mixture-of-Experts (MoE) models. MoE models are an asset for training larger models within the constraints of limited computational resources by activating only a subset of parameters for each input. DeepSpeed has made strides in this area, but the paper signifies that further improvements are possible, particularly concerning load balancing, communication and computation efficiency, and memory storage limitations.

Enhancing MoE Training and Inference

SE-MoE addresses several challenges in the field of MoE models. It utilizes Elastic MoE training to control load balancing and communication through intuitive prefetch scheduling and innovative communication methods. This strategic approach enhances parallelism during training and stretches across hierarchical storage solutions. For inference, particularly for models that surpass GPU memory capacity, SE-MoE sets out an approach whereby CPU and GPU memory are fashioned into a contiguous ring, allowing computation to cycle through the sections efficiently. This circumvents the memory constraints typically imposed by GPU limitations.

Empirical Verification

Through extensive experimentation, SE-MoE's capacity to outperform existing systems like DeepSpeed has been showcased. It has successfully trained an MoE-based Unified Feature Optimization (UFO) model with 12 billion parameters in record time while achieving considerably higher throughput in both training and inference phases. Notably, under scenarios where an unbalanced workload is present – a common scenario in multi-task learning – SE-MoE presents a remarkable improvement in throughput and reduces memory footprint significantly.

Futuristic Perspectives

This paper's contributions to MoE model training and inference represent a significant advance in machine learning infrastructure, providing a beacon for future work to progress toward more efficient, resource-aware, and scalable MoE systems. The SE-MoE framework, which will be publicly available, signifies a leap towards training extraordinarily large models more feasibly and with consideration to energy efficiency and environmental impact. The promise of this research opens the door to further optimization that will bolster the position of sparsely activated network-based training in a variety of machine learning tasks, pushing the boundaries of current models in terms of size, speed, and efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  4. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  5. Efficient large-scale language model training on gpu clusters using megatron-lm. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–15, 2021.
  6. Ernie 3.0 titan: Exploring larger-scale knowledge enhanced pre-training for language understanding and generation. arXiv preprint arXiv:2112.12731, 2021.
  7. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  8. Aibox: Ctr prediction model training on a single node. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, pages 319–328, 2019.
  9. Distributed hierarchical gpu parameter server for massive scale deep learning ads systems. Proceedings of Machine Learning and Systems, 2:412–428, 2020.
  10. Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
  11. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.41. URL https://aclanthology.org/2021.naacl-main.41.
  12. Massively multilingual neural machine translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3874–3884, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1388. URL https://aclanthology.org/N19-1388.
  13. Multi-task learning for multilingual neural machine translation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1022–1034, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.75. URL https://aclanthology.org/2020.emnlp-main.75.
  14. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  15. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  16. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961, 2021.
  17. Glam: Efficient scaling of language models with mixture-of-experts. arXiv preprint arXiv:2112.06905, 2021.
  18. Taming sparsely activated transformer with stochastic experts. arXiv preprint arXiv:2110.04260, 2021.
  19. Fastmoe: A fast mixture-of-expert training system. arXiv preprint arXiv:2103.13262, 2021.
  20. M6-t: Exploring sparse expert models and beyond. arXiv preprint arXiv:2105.15082, 2021.
  21. Gpus and the future of parallel computing. IEEE micro, 31(5):7–17, 2011.
  22. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14, 2021.
  23. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  24. wikipedia. Solid-state drive. https://en.wikipedia.org/wiki/Solid-state_drive, 2022.
  25. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  26. Intel. Memory optimized for data-centric workloads. https://www.intel.cn/content/www/cn/zh/architecture-and-technology/optane-dc-persistent-memory.html, 2018.
  27. Leonid B Sokolinsky. Lfu-k: An effective buffer management replacement algorithm. In International Conference on Database Systems for Advanced Applications, pages 670–681. Springer, 2004.
  28. Efficient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684, 2021.
  29. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2(7), 2015.
  30. Pre-trained summarization distillation. arXiv preprint arXiv:2010.13002, 2020.
  31. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  32. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. arXiv preprint arXiv:2201.05596, 2022.
  33. Mlperf training benchmark. Proceedings of Machine Learning and Systems, 2:336–349, 2020.
  34. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019.
  35. Embrace: Accelerating sparse communication for distributed training of nlp neural networks. arXiv preprint arXiv:2110.09132, 2021.
  36. Hard mixtures of experts for large scale weakly supervised vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6865–6873, 2017.
  37. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  38. Base layers: Simplifying training of large, sparse models. In International Conference on Machine Learning, pages 6265–6274. PMLR, 2021.
  39. Cpm-2: Large-scale cost-effective pre-trained language models. AI Open, 2:216–224, 2021.
  40. M6-10t: A sharing-delinking paradigm for efficient multi-trillion parameter pretraining. arXiv preprint arXiv:2110.03888, 2021.
  41. Baidu. A new paradigm of large model application: unified feature representation optimization. https://mp.weixin.qq.com/s/GgPrDJwYsvRSWQ0D3LgI3w, 2022.
  42. Scalable and efficient moe training for multitask multilingual models. arXiv preprint arXiv:2109.10465, 2021.
  43. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  44. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
  45. Microsoft. Tutel: An efficient mixture-of-experts implementation for large dnn model training. https://www.microsoft.com/en-us/research/blog/tutel-an-efficient-mixture-of-experts-implementation-for-large-dnn-model-training/, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Liang Shen (26 papers)
  2. Hongxiang Hao (6 papers)
  3. HuaChao Wu (1 paper)
  4. Jiang Bian (229 papers)
  5. Haoyi Xiong (98 papers)
  6. Dianhai Yu (37 papers)
  7. Weibao Gong (5 papers)
  8. Lirong Dai (31 papers)
Citations (24)