Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts Large Language Models (2402.14800v2)

Published 22 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: A pivotal advancement in the progress of LLMs is the emergence of the Mixture-of-Experts (MoE) LLMs. Compared to traditional LLMs, MoE LLMs can achieve higher performance with fewer parameters, but it is still hard to deploy them due to their immense parameter sizes. Different from previous weight pruning methods that rely on specifically designed hardware, this paper mainly aims to enhance the deployment efficiency of MoE LLMs by introducing plug-and-play expert-level sparsification techniques. Specifically, we propose, for the first time to our best knowledge, post-training approaches for task-agnostic and task-specific expert pruning and skipping of MoE LLMs, tailored to improve deployment efficiency while maintaining model performance across a wide range of tasks. Extensive experiments show that our proposed methods can simultaneously reduce model sizes and increase the inference speed, while maintaining satisfactory performance. Data and code will be available at https://github.com/Lucky-Lance/Expert_Sparsity.

PDF HTML Abstract

Overview of Efficient Expert Pruning and Skipping in MoE LLMs

The paper "Not All Experts are Equal: Efficient Expert Pruning and Skipping for Mixture-of-Experts LLMs" presents a novel approach to enhance the deployment efficiency of Mixture-of-Experts (MoE) LLMs. MoE LLMs have shown promise due to their ability to achieve high performance with fewer parameters compared to dense models. However, their substantial parameter sizes pose challenges for practical deployment.

Key Contributions

Expert-Level Sparsification:
- The paper introduces strategies for expert pruning and skipping, tailoring them to improve both deployment efficiency and inference speed, without compromising model performance.
- These methods are designed as post-training techniques, applicable in both task-agnostic and task-specific contexts.
Post-Training Expert Pruning:
- Unlike weight pruning methods requiring specific hardware, this approach efficiently reduces the number of active experts in an MoE model.
- The paper proposes a layer-wise enumeration method, considering the importance of experts based on reconstruction loss, allowing parameters to be pruned while maintaining competitive performance.
- For domain-specific tasks, calibration data are selected from related datasets to optimize pruning outcomes.
Dynamic Expert Skipping:
- Beyond static pruning, a dynamic method is introduced to skip certain experts during inference, optimizing on-the-fly computational demands.
- This dynamic approach complements expert pruning, contributing to a more refined and efficient deployment pipeline.

Experimental Outcomes

Performance Metrics:
- Experiments on MoE LLMs such as Mixtral 8x7B demonstrate significant reductions in memory consumption and inference speedup, especially when combining both pruning and skipping techniques.
- The pruned model with fewer experts shows a minimal performance drop (approximately 2.9 points for task-agnostic models with two experts pruned).
Generation Speed:
- The combined pruning and skipping approach achieved approximately 1.33× inference speedup over the unmodified model while running on fewer GPUs, highlighting a reduction in intercommunication overhead between GPUs.

Implications

The research indicates that focused expert pruning and dynamic skipping can lead to efficient utilization of MoE LLMs, requisite for practical application across diverse computational environments. By addressing both task-general and task-specific pruning, the paper broadens the applicability of these models beyond typical language tasks to domain-specific computations such as mathematical reasoning.

Future Directions

The paper opens avenues for integrating expert sparsification with other model optimization techniques, such as weight pruning and quantization, potentially enhancing efficiency further across varying scales of LLM architectures.

This paper contributes substantially to the understanding and deployment of sparsely-gated neural networks, potentially informing future developments in both foundational models and task-specific implementations within AI.