MegaScale-MoE: Large-Scale Communication-Efficient Training of Mixture-of-Experts Models in Production (2505.11432v2)

Published 16 May 2025 in cs.LG and cs.DC

Abstract: We present MegaScale-MoE, a production system tailored for the efficient training of large-scale mixture-of-experts (MoE) models. MoE emerges as a promising architecture to scale LLMs to unprecedented sizes, thereby enhancing model performance. However, existing MoE training systems experience a degradation in training efficiency, exacerbated by the escalating scale of MoE models and the continuous evolution of hardware. Recognizing the pivotal role of efficient communication in enhancing MoE training, MegaScale-MoE customizes communication-efficient parallelism strategies for attention and FFNs in each MoE layer and adopts a holistic approach to overlap communication with computation at both inter- and intra-operator levels. Additionally, MegaScale-MoE applies communication compression with adjusted communication patterns to lower precision, further improving training efficiency. When training a 352B MoE model on 1,440 NVIDIA Hopper GPUs, MegaScale-MoE achieves a training throughput of 1.41M tokens/s, improving the efficiency by 1.88$\times$ compared to Megatron-LM. We share our operational experience in accelerating MoE training and hope that by offering our insights in system design, this work will motivate future research in MoE systems.

PDF Abstract

Overview of MegaScale-MoE

MegaScale-MoE represents a significant advancement in the efficient training of large-scale mixture-of-experts (MoE) models in production environments. This paper outlines the comprehensive system design features implemented to overcome the inefficiencies typically encountered with the training of MoE models. Traditionally, as the scale and complexity of MoE models expand, communication bottlenecks significantly impact training efficiency. MegaScale-MoE addresses these challenges by employing advanced strategies for communication-efficient parallelism, overlapping communication with computation, and utilizing communication compression techniques.

In terms of quantitative outcomes, MegaScale-MoE achieves a remarkable speed-up by improving training throughput to 1.41 million tokens/second on a configuration of 1,440 NVIDIA Hopper GPUs when training a 352 billion parameter MoE model. This translates to an impressive 1.88 times efficiency improvement compared to the state-of-the-art Megatron-LM framework. These results underscore MegaScale-MoE's capacity to scale model training efficiently, even at exceptionally large parameter sizes and GPU counts.

Communication-Efficient Parallelism

The paper introduces customized parallelism strategies that significantly curtail communication overhead, a key obstacle in MoE training. Specifically, MegaScale-MoE revises existing approaches by implementing sequence parallelism (SP) for attention modules and expert parallelism (EP) for the feed-forward network, supplanting tensor parallelism (TP) used in prior systems. This tailored approach not only optimizes parallelism but also enhances GEMM efficiency by aligning operations with the natural architectural characteristics of MoE models.

MegaScale-MoE contrasts with conventional practices by reducing the communication volume considerably. Particularly, MegaScale-MoE's expert parallelism efficiently adapts communication strategies based on the number of experts to minimize the latency associated with token dispatch and aggregation.

Communication-Computation Overlap

In order to mitigate the downtime induced by communication delays, MegaScale-MoE employs both inter- and intra-operator overlapping techniques. Inter-operator overlap is achieved through macro module execution that allows communication operations to be hidden within computation sequences. Further, the intra-operator overlap entails decomposing operations into tiles, enabling concurrent execution of communication and computation kernels. MegaScale-MoE demonstrates that through these systematic scheduling strategies, communication overhead can be substantially minimized or even eliminated, synchronizing the computational workload among GPUs more efficiently.

Communication Compression

MegaScale-MoE further reduces communication overhead through compression strategies, notably in its application of mixed-precision formats. In BF16 training scenarios, MegaScale-MoE performs gradient synchronization using reduced precision without sacrificing convergence stability. For models utilizing FP8 precision, careful quantization strategies maintain alignment with BF16 training results while enabling further reductions in communication volume.

Implications and Future Directions

MegaScale-MoE sets a notable precedent in addressing the scalability challenges intrinsic to MoE model training. From a practical standpoint, it facilitates substantial resource savings and time efficiency for large-scale production environments where computational demands are ever-increasing. Theoretically, MegaScale-MoE expands our understanding of distributed training architectures and their physical capability limits. It provides a pathway to even larger models beyond trillions of parameters, navigating past current bottlenecks in GPU networking bandwidths.

As hinted by the operational experience shared in the paper, MegaScale-MoE paves the way for potential auto-scheduling of computational and communication activities, enhancing even further the efficiency and adaptability of massive model training systems. Moving forward, future research could explore integrating machine learning-based optimizations into these overlaps to dynamically adjust scheduling based on system feedback, optimizing for evolving hardware contexts. Additionally, algorithm-system co-design could further refine hardware utilization efficiency, closing the gap between peak performance capacities and real-world functional utility in LLM training.

PDF Markdown Bookmark Chat (Pro)

Authors (19)

Chao Jin (30 papers)
Ziheng Jiang (23 papers)
Zhihao Bai (5 papers)
Zheng Zhong (14 papers)
Juncai Liu (3 papers)
Xiang Li (1002 papers)
Ningxin Zheng (15 papers)
Xi Wang (275 papers)
Cong Xie (33 papers)
Wen Heng (8 papers)
Yiyuan Ma (4 papers)
Wenlei Bao (7 papers)
Size Zheng (11 papers)
Yanghua Peng (18 papers)
Haibin Lin (35 papers)
Xuanzhe Liu (59 papers)
Xin Jin (285 papers)
Xin Liu (820 papers)
Qi Huang (75 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/papers_anon/status/1924322653294448687

https://twitter.com/TheAIObserverX/status/1927221056776884732

YouTube

Show All Videos