Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference (2408.10284v1)

Published 19 Aug 2024 in cs.LG

Abstract: Mixture-of-Experts (MoE) models are designed to enhance the efficiency of LLMs without proportionally increasing the computational demands. However, their deployment on edge devices still faces significant challenges due to high on-demand loading overheads from managing sparsely activated experts. This paper introduces AdapMoE, an algorithm-system co-design framework for efficient MoE inference. AdapMoE features adaptive expert gating and management to reduce the on-demand loading overheads. We observe the heterogeneity of experts loading across layers and tokens, based on which we propose a sensitivity-based strategy to adjust the number of activated experts dynamically. Meanwhile, we also integrate advanced prefetching and cache management techniques to further reduce the loading latency. Through comprehensive evaluations on various platforms, we demonstrate AdapMoE consistently outperforms existing techniques, reducing the average number of activated experts by 25% and achieving a 1.35x speedup without accuracy degradation. Code is available at: https://github.com/PKU-SEC-Lab/AdapMoE.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Shuzhang Zhong (5 papers)
  2. Ling Liang (41 papers)
  3. Yuan Wang (251 papers)
  4. Runsheng Wang (49 papers)
  5. Ru Huang (52 papers)
  6. Meng Li (244 papers)
Citations (3)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com