MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism (2504.02263v3)

Published 3 Apr 2025 in cs.DC and cs.LG

Abstract: Mixture-of-Experts (MoE) showcases tremendous potential to scale LLMs with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) from being compute-intensive to memory-intensive during inference, leading to substantially lower GPU utilization and increased operational costs. We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models. MegaScale-Infer disaggregates attention and FFN modules within each model layer, enabling independent scaling, tailored parallelism strategies, and heterogeneous deployment for both modules. To fully exploit disaggregation in the presence of MoE's sparsity, MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU utilization. To adapt to disaggregated attention and FFN modules and minimize data transmission overhead (e.g., token dispatch), MegaScale-Infer provides a high-performance M2N communication library that eliminates unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization. Experimental results indicate that MegaScale-Infer achieves up to 1.90x higher per-GPU throughput than state-of-the-art solutions.

PDF Abstract

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

The paper "MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism" addresses critical challenges in the efficient large-scale serving of Mixture-of-Experts (MoE) models, focusing on enhancing GPU utilization and reducing operational costs. The MoE archutecture, which dynamically routes input tokens to a subset of feed-forward networks (FFNs), traditionally shifts from compute-intensive to memory-intensive during inference, thereby impacting resource efficiency.

Key Contributions

Disaggregated Expert Parallelism: MegaScale-Infer introduces a novel disaggregation of attention and FFN modules within each model layer. This separation allows for independent tuning and scaling strategies for each module, optimizing for the specific operational characteristics of memory-intensive attention and compute-intensive FFNs.
Ping-Pong Pipeline Parallelism: The system implements a ping-pong pipeline strategy that divides a request batch into micro-batches that alternate between attention and FFN computations. This approach effectively seeks to minimize idle times and communication overhead, thereby maximizing GPU throughput.
Custom M2N Communication Library: A high-performance communication library is developed to facilitate efficient data flow between the disaggregated modules by eliminating unnecessary GPU-to-CPU data transfers and synchronization delays. The library significantly reduces communication latency and operational overhead.

Experimental Results

MegaScale-Infer demonstrates substantial performance improvements over state-of-the-art LLM serving systems. It achieves up to 1.90× higher per-GPU throughput and improves throughput per dollar by 1.7× in a heterogeneous deployment setup. Moreover, the customized communication library shows impressive gains with 4.2× higher throughput and a 68.2% reduction in latency compared to existing communication libraries like NCCL.

Discussion and Implications

The deployment of MegaScale-Infer highlights distinct advantages in serving large-scale MoE models more efficiently than existing methods. By separately optimizing for attention and FFN modules and utilizing heterogeneous deployment strategies, MegaScale-Infer demonstrates how architecture-specific optimizations can lead to significant cost-performance benefits. This contributes to the theoretical discourse on efficient deep learning model serving and presents practical improvements for AI applications reliant on MoE architectures.

Future developments in this area may leverage MegaScale-Infer's strategies to further refine resource allocation and model serving efficiency. Continuous advances in model parallelism and communication strategies promise even greater optimization potential, lending valuable insights into how AI infrastructure can scale effectively while managing cost constraints.

PDF Markdown Bookmark Chat (Pro)

Authors (20)

Ruidong Zhu (2 papers)
Ziheng Jiang (23 papers)
Chao Jin (30 papers)
Peng Wu (119 papers)
Cesar A. Stuardo (1 paper)
Dongyang Wang (27 papers)
Xinlei Zhang (10 papers)
Huaping Zhou (2 papers)
Haoran Wei (55 papers)
Yang Cheng (50 papers)
Jianzhe Xiao (1 paper)
Xinyi Zhang (88 papers)
Lingjun Liu (13 papers)
Haibin Lin (35 papers)
Li-Wen Chang (8 papers)
Jianxi Ye (6 papers)
Xiao Yu (66 papers)
Xuanzhe Liu (59 papers)
Xin Jin (285 papers)
Xin Liu (820 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/pstAsiatech/status/1909244201918693517

https://twitter.com/papers_anon/status/1907985674495578348

https://twitter.com/waseem_s/status/1908218938913120666

https://twitter.com/gm8xx8/status/1908327588864413704

https://twitter.com/HPCPapers/status/1919286823886688610

https://twitter.com/TheTuringPost/status/1909578232602767413