Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling (2503.04398v3)

Published 6 Mar 2025 in cs.LG, cs.AI, and cs.DC

Abstract: MoE (Mixture of Experts) prevails as a neural architecture that can scale modern transformer-based LLMs to unprecedented scales. Nevertheless, large MoEs' great demands of computing power, memory capacity and memory bandwidth make scalable serving a fundamental challenge and efficient parallel inference has become a requisite to attain adequate throughput under latency constraints. DeepSpeed-MoE, one state-of-the-art MoE inference framework, adopts a 3D-parallel paradigm including EP (Expert Parallelism), TP (Tensor Parallel) and DP (Data Parallelism). However, our analysis shows DeepSpeed-MoE's inference efficiency is largely bottlenecked by EP, which is implemented with costly all-to-all collectives to route token activation. Our work aims to boost DeepSpeed-MoE by strategically reducing EP's communication overhead with a technique named Speculative MoE. Speculative MoE has two speculative parallelization schemes, speculative token shuffling and speculative expert grouping, which predict outstanding tokens' expert routing paths and pre-schedule tokens and experts across devices to losslessly trim EP's communication volume. Besides DeepSpeed-MoE, we also build Speculative MoE into a prevailing MoE inference engine SGLang. Experiments show Speculative MoE can significantly boost state-of-the-art MoE inference frameworks on fast homogeneous and slow heterogeneous interconnects.

Authors (7)

Yan Li (505 papers)
Pengfei Zheng (11 papers)
Shuang Chen (46 papers)
Zewei Xu (8 papers)
Yunfei Du (7 papers)
Zhengang Wang (1 paper)
Yuanhao Lai (3 papers)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/captradeoff/status/1902737234446708846

https://twitter.com/HPCPapers/status/1899423536135242166

https://twitter.com/devgerred/status/1923817817876406465

https://twitter.com/HPCPapers/status/1899415984781549607

Speculative MoE: Communication Efficient Parallel MoE Inference with Speculative Token and Expert Pre-scheduling (2503.04398v3)

Summary

Related Papers

Tweets