MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism
The paper "MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism" addresses critical challenges in the efficient large-scale serving of Mixture-of-Experts (MoE) models, focusing on enhancing GPU utilization and reducing operational costs. The MoE archutecture, which dynamically routes input tokens to a subset of feed-forward networks (FFNs), traditionally shifts from compute-intensive to memory-intensive during inference, thereby impacting resource efficiency.
Key Contributions
- Disaggregated Expert Parallelism: MegaScale-Infer introduces a novel disaggregation of attention and FFN modules within each model layer. This separation allows for independent tuning and scaling strategies for each module, optimizing for the specific operational characteristics of memory-intensive attention and compute-intensive FFNs.
- Ping-Pong Pipeline Parallelism: The system implements a ping-pong pipeline strategy that divides a request batch into micro-batches that alternate between attention and FFN computations. This approach effectively seeks to minimize idle times and communication overhead, thereby maximizing GPU throughput.
- Custom M2N Communication Library: A high-performance communication library is developed to facilitate efficient data flow between the disaggregated modules by eliminating unnecessary GPU-to-CPU data transfers and synchronization delays. The library significantly reduces communication latency and operational overhead.
Experimental Results
MegaScale-Infer demonstrates substantial performance improvements over state-of-the-art LLM serving systems. It achieves up to 1.90× higher per-GPU throughput and improves throughput per dollar by 1.7× in a heterogeneous deployment setup. Moreover, the customized communication library shows impressive gains with 4.2× higher throughput and a 68.2% reduction in latency compared to existing communication libraries like NCCL.
Discussion and Implications
The deployment of MegaScale-Infer highlights distinct advantages in serving large-scale MoE models more efficiently than existing methods. By separately optimizing for attention and FFN modules and utilizing heterogeneous deployment strategies, MegaScale-Infer demonstrates how architecture-specific optimizations can lead to significant cost-performance benefits. This contributes to the theoretical discourse on efficient deep learning model serving and presents practical improvements for AI applications reliant on MoE architectures.
Future developments in this area may leverage MegaScale-Infer's strategies to further refine resource allocation and model serving efficiency. Continuous advances in model parallelism and communication strategies promise even greater optimization potential, lending valuable insights into how AI infrastructure can scale effectively while managing cost constraints.