Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism (2504.02263v3)

Published 3 Apr 2025 in cs.DC and cs.LG

Abstract: Mixture-of-Experts (MoE) showcases tremendous potential to scale LLMs with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) from being compute-intensive to memory-intensive during inference, leading to substantially lower GPU utilization and increased operational costs. We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models. MegaScale-Infer disaggregates attention and FFN modules within each model layer, enabling independent scaling, tailored parallelism strategies, and heterogeneous deployment for both modules. To fully exploit disaggregation in the presence of MoE's sparsity, MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU utilization. To adapt to disaggregated attention and FFN modules and minimize data transmission overhead (e.g., token dispatch), MegaScale-Infer provides a high-performance M2N communication library that eliminates unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization. Experimental results indicate that MegaScale-Infer achieves up to 1.90x higher per-GPU throughput than state-of-the-art solutions.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (20)
  1. Ruidong Zhu (2 papers)
  2. Ziheng Jiang (23 papers)
  3. Chao Jin (30 papers)
  4. Peng Wu (119 papers)
  5. Cesar A. Stuardo (1 paper)
  6. Dongyang Wang (27 papers)
  7. Xinlei Zhang (10 papers)
  8. Huaping Zhou (2 papers)
  9. Haoran Wei (55 papers)
  10. Yang Cheng (50 papers)
  11. Jianzhe Xiao (1 paper)
  12. Xinyi Zhang (88 papers)
  13. Lingjun Liu (13 papers)
  14. Haibin Lin (35 papers)
  15. Li-Wen Chang (8 papers)
  16. Jianxi Ye (6 papers)
  17. Xiao Yu (66 papers)
  18. Xuanzhe Liu (59 papers)
  19. Xin Jin (285 papers)
  20. Xin Liu (820 papers)

Summary

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

The paper "MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism" addresses critical challenges in the efficient large-scale serving of Mixture-of-Experts (MoE) models, focusing on enhancing GPU utilization and reducing operational costs. The MoE archutecture, which dynamically routes input tokens to a subset of feed-forward networks (FFNs), traditionally shifts from compute-intensive to memory-intensive during inference, thereby impacting resource efficiency.

Key Contributions

  1. Disaggregated Expert Parallelism: MegaScale-Infer introduces a novel disaggregation of attention and FFN modules within each model layer. This separation allows for independent tuning and scaling strategies for each module, optimizing for the specific operational characteristics of memory-intensive attention and compute-intensive FFNs.
  2. Ping-Pong Pipeline Parallelism: The system implements a ping-pong pipeline strategy that divides a request batch into micro-batches that alternate between attention and FFN computations. This approach effectively seeks to minimize idle times and communication overhead, thereby maximizing GPU throughput.
  3. Custom M2N Communication Library: A high-performance communication library is developed to facilitate efficient data flow between the disaggregated modules by eliminating unnecessary GPU-to-CPU data transfers and synchronization delays. The library significantly reduces communication latency and operational overhead.

Experimental Results

MegaScale-Infer demonstrates substantial performance improvements over state-of-the-art LLM serving systems. It achieves up to 1.90× higher per-GPU throughput and improves throughput per dollar by 1.7× in a heterogeneous deployment setup. Moreover, the customized communication library shows impressive gains with 4.2× higher throughput and a 68.2% reduction in latency compared to existing communication libraries like NCCL.

Discussion and Implications

The deployment of MegaScale-Infer highlights distinct advantages in serving large-scale MoE models more efficiently than existing methods. By separately optimizing for attention and FFN modules and utilizing heterogeneous deployment strategies, MegaScale-Infer demonstrates how architecture-specific optimizations can lead to significant cost-performance benefits. This contributes to the theoretical discourse on efficient deep learning model serving and presents practical improvements for AI applications reliant on MoE architectures.

Future developments in this area may leverage MegaScale-Infer's strategies to further refine resource allocation and model serving efficiency. Continuous advances in model parallelism and communication strategies promise even greater optimization potential, lending valuable insights into how AI infrastructure can scale effectively while managing cost constraints.