Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts (2405.16646v3)

Published 26 May 2024 in cs.LG

Abstract: The sparsely gated mixture of experts (MoE) architecture sends different inputs to different subnetworks, i.e., experts, through trainable routers. MoE reduces the training computation significantly for large models, but its deployment can be still memory or computation expensive for some downstream tasks. Model pruning is a popular approach to reduce inference computation, but its application in MoE architecture is largely unexplored. To the best of our knowledge, this paper provides the first provably efficient technique for pruning experts in finetuned MoE models. We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy, while significantly reducing the model size and the computational requirements. Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models such as VMoE and E3MoE finetuned on benchmark datasets such as CIFAR10, CIFAR100, and ImageNet.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Mohammed Nowaz Rabbani Chowdhury (2 papers)
  2. Meng Wang (1063 papers)
  3. Kaoutar El Maghraoui (12 papers)
  4. Naigang Wang (15 papers)
  5. Pin-Yu Chen (311 papers)
  6. Christopher Carothers (1 paper)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets