Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models (2402.14800v2)

Published 22 Feb 2024 in cs.CL, cs.AI, and cs.LG
Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models

Abstract: A pivotal advancement in the progress of LLMs is the emergence of the Mixture-of-Experts (MoE) LLMs. Compared to traditional LLMs, MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes. Different from previous weight pruning methods that rely on specifically designed hardware, this paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques. Specifically, we propose, for the first time to our best knowledge, post-training approaches for task-agnostic and task-specific expert pruning and skipping of MoE LLMs, tailored to improve deployment efficiency while maintaining model performance across a wide range of tasks. Extensive experiments show that our proposed methods can simultaneously reduce model sizes and increase the inference speed, while maintaining satisfactory performance. Data and code will be available at https://github.com/Lucky-Lance/Expert_Sparsity.

Overview of Efficient Expert Pruning and Skipping in MoE LLMs

The paper "Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts LLMs" presents a novel approach to enhance the deployment efficiency of Mixture-of-Experts (MoE) LLMs. MoE LLMs have shown promise due to their ability to achieve high performance with fewer parameters compared to dense models. However, their substantial parameter sizes pose challenges for practical deployment.

Key Contributions

  1. Expert-Level Sparsification:
    • The paper introduces strategies for expert pruning and skipping, tailoring them to improve both deployment efficiency and inference speed, without compromising model performance.
    • These methods are designed as post-training techniques, applicable in both task-agnostic and task-specific contexts.
  2. Post-Training Expert Pruning:
    • Unlike weight pruning methods requiring specific hardware, this approach efficiently reduces the number of active experts in an MoE model.
    • The paper proposes a layer-wise enumeration method, considering the importance of experts based on reconstruction loss, allowing parameters to be pruned while maintaining competitive performance.
    • For domain-specific tasks, calibration data are selected from related datasets to optimize pruning outcomes.
  3. Dynamic Expert Skipping:
    • Beyond static pruning, a dynamic method is introduced to skip certain experts during inference, optimizing on-the-fly computational demands.
    • This dynamic approach complements expert pruning, contributing to a more refined and efficient deployment pipeline.

Experimental Outcomes

  • Performance Metrics:
    • Experiments on MoE LLMs such as Mixtral 8x7B demonstrate significant reductions in memory consumption and inference speedup, especially when combining both pruning and skipping techniques.
    • The pruned model with fewer experts shows a minimal performance drop (approximately 2.9 points for task-agnostic models with two experts pruned).
  • Generation Speed:
    • The combined pruning and skipping approach achieved approximately 1.33× inference speedup over the unmodified model while running on fewer GPUs, highlighting a reduction in intercommunication overhead between GPUs.

Implications

The research indicates that focused expert pruning and dynamic skipping can lead to efficient utilization of MoE LLMs, requisite for practical application across diverse computational environments. By addressing both task-general and task-specific pruning, the paper broadens the applicability of these models beyond typical language tasks to domain-specific computations such as mathematical reasoning.

Future Directions

The paper opens avenues for integrating expert sparsification with other model optimization techniques, such as weight pruning and quantization, potentially enhancing efficiency further across varying scales of LLM architectures.

This paper contributes substantially to the understanding and deployment of sparsely-gated neural networks, potentially informing future developments in both foundational models and task-specific implementations within AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Task-specific expert pruning for sparse mixture-of-experts. arXiv preprint arXiv:2206.00277.
  3. On the representation collapse of sparse mixture of experts. In Advances in Neural Information Processing Systems.
  4. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  5. An algorithm–hardware co-optimized framework for accelerating n: M sparse transformers. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 30(11):1573–1586.
  6. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270.
  7. Elias Frantar and Dan Alistarh. 2023. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR.
  8. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
  9. Megablocks: Efficient sparse training with mixture-of-experts. Proceedings of Machine Learning and Systems, 5.
  10. A framework for few-shot language model evaluation.
  11. Channel pruning for accelerating very deep neural networks. In Proceedings of the IEEE international conference on computer vision, pages 1389–1397.
  12. Measuring mathematical problem solving with the math dataset. NeurIPS.
  13. Accelerated sparse neural training: A provable and efficient method to find n: m transposable masks. Advances in neural information processing systems, 34:21099–21111.
  14. Adaptive mixtures of local experts. Neural computation, 3(1):79–87.
  15. Mixtral of experts. arXiv preprint arXiv:2401.04088.
  16. Scalable and efficient moe training for multitask multilingual models. arXiv preprint arXiv:2109.10465.
  17. Memory-efficient nllb-200: Language-specific expert pruning of a massively multilingual machine translation model. arXiv preprint arXiv:2212.09811.
  18. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35:24101–24116.
  19. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668.
  20. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978.
  21. Accelerating sparse deep neural networks. arXiv preprint arXiv:2104.08378.
  22. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  23. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints.
  24. Mixture of experts explained.
  25. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538.
  26. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.
  27. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  28. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  29. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  30. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45.
  31. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
  32. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
  33. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xudong Lu (17 papers)
  2. Qi Liu (485 papers)
  3. Yuhui Xu (28 papers)
  4. Aojun Zhou (45 papers)
  5. Siyuan Huang (123 papers)
  6. Bo Zhang (633 papers)
  7. Junchi Yan (241 papers)
  8. Hongsheng Li (340 papers)
Citations (13)
Github Logo Streamline Icon: https://streamlinehq.com