Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts
The paper under review presents a novel methodology for pruning Mixture-of-Experts (MoE) models in a task-agnostic manner, addressing a critical challenge in the deployment of LLMs. The authors propose a method to enhance parameter efficiency by grouping and pruning redundant experts within MoE layers. This approach is empirically validated on state-of-the-art models such as Mixtral-8x7B and Mixtral-8x22B, demonstrating superior performance over existing pruning techniques.
Introduction
LLMs have shown significant advancements by scaling parameters through architectures like the sparsely-activated MoE. Despite their high performance, the large number of experts in MoE models incurs substantial memory costs, impeding their practicability in real-world applications. This paper introduces a pruning method that does not rely on task-specific information, thereby being more versatile and broadly applicable.
Methodology
The core of the proposed method revolves around identifying and pruning redundant experts in a task-agnostic fashion. The approach comprises two main stages:
- Expert Similarity Estimation:
- Utilize the Centered Kernel Alignment (CKA) to quantify the similarity between experts within the same MoE layer. This metric captures how similarly different experts respond to the same input data.
- Pruning and Merging Experts:
- Group similar experts into clusters based on a graph partitioning algorithm. Each group of similar experts is then merged into a single expert, along with their corresponding routing weights.
This two-step strategy helps retain as much of the original knowledge encoded in the experts while reducing redundancy and memory usage.
Experimental Results
The authors conducted extensive experiments to validate their method. The main evaluation metrics were zero-shot performance on standardized datasets like MMLU, BoolQ, OpenBookQA, and RTE. The key results are summarized as follows:
- Mixtral-8x7B: The proposed methods outperform existing pruning strategies by an average margin of 1.5%, maintaining a competitive performance despite a reduction in the number of experts.
- Mixtral-8x22B: The approach using surrogate weight representations achieves the best results, with only a 2.8% average performance drop compared to the full model.
Empirical Analysis
The paper also provides a detailed empirical analysis of expert behavior before and after pruning. By comparing the visiting frequency of tokens among experts, the authors illustrate the effectiveness of their pruning method in reducing expert redundancy while preserving task-specific knowledge diversity.
Implications and Future Work
The implications of this research are significant for the deployment of LLMs in resource-constrained environments. By efficiently pruning redundant experts, this method paves the way for more practical and scalable applications of LLMs without substantial performance degradation. Future research could explore adaptive pruning strategies that dynamically adjust the number of experts based on task requirements and computational constraints.
Conclusion
The proposed task-agnostic pruning method effectively addresses the challenge of memory consumption in sparse MoE architectures. By discovering and merging similar experts, this approach not only reduces memory usage but also maintains high performance across various tasks. This contribution is valuable for enhancing the practicality of deploying large-scale models in diverse settings.