Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts (2407.09590v3)

Published 12 Jul 2024 in cs.CL and cs.LG

Abstract: By increasing model parameters but activating them sparsely when performing a task, the use of Mixture-of-Experts (MoE) architecture significantly improves the performance of LLMs without increasing the inference cost. However, the memory consumption due to the growing number of experts presents a challenge to the deployment of these models in many real world settings. Our empirical study reveals that some experts encode redundant knowledge during pre-training. We thus propose a method of grouping and pruning similar experts to improve the model's parameter efficiency. We validate the effectiveness of our method by pruning three state-of-the-art MoE architectures, including Mixtral, Deepseek-MoE, and Qwen. The evaluation shows that our method outperforms other model pruning methods on a range of natural language tasks. We will release our code to facilitate future research.

Diversifying the Expert Knowledge for Task-Agnostic Pruning in Sparse Mixture-of-Experts

The paper under review presents a novel methodology for pruning Mixture-of-Experts (MoE) models in a task-agnostic manner, addressing a critical challenge in the deployment of LLMs. The authors propose a method to enhance parameter efficiency by grouping and pruning redundant experts within MoE layers. This approach is empirically validated on state-of-the-art models such as Mixtral-8x7B and Mixtral-8x22B, demonstrating superior performance over existing pruning techniques.

Introduction

LLMs have shown significant advancements by scaling parameters through architectures like the sparsely-activated MoE. Despite their high performance, the large number of experts in MoE models incurs substantial memory costs, impeding their practicability in real-world applications. This paper introduces a pruning method that does not rely on task-specific information, thereby being more versatile and broadly applicable.

Methodology

The core of the proposed method revolves around identifying and pruning redundant experts in a task-agnostic fashion. The approach comprises two main stages:

  1. Expert Similarity Estimation:
    • Utilize the Centered Kernel Alignment (CKA) to quantify the similarity between experts within the same MoE layer. This metric captures how similarly different experts respond to the same input data.
  2. Pruning and Merging Experts:

    • Group similar experts into clusters based on a graph partitioning algorithm. Each group of similar experts is then merged into a single expert, along with their corresponding routing weights.

    This two-step strategy helps retain as much of the original knowledge encoded in the experts while reducing redundancy and memory usage.

Experimental Results

The authors conducted extensive experiments to validate their method. The main evaluation metrics were zero-shot performance on standardized datasets like MMLU, BoolQ, OpenBookQA, and RTE. The key results are summarized as follows:

  • Mixtral-8x7B: The proposed methods outperform existing pruning strategies by an average margin of 1.5%, maintaining a competitive performance despite a reduction in the number of experts.
  • Mixtral-8x22B: The approach using surrogate weight representations achieves the best results, with only a 2.8% average performance drop compared to the full model.

Empirical Analysis

The paper also provides a detailed empirical analysis of expert behavior before and after pruning. By comparing the visiting frequency of tokens among experts, the authors illustrate the effectiveness of their pruning method in reducing expert redundancy while preserving task-specific knowledge diversity.

Implications and Future Work

The implications of this research are significant for the deployment of LLMs in resource-constrained environments. By efficiently pruning redundant experts, this method paves the way for more practical and scalable applications of LLMs without substantial performance degradation. Future research could explore adaptive pruning strategies that dynamically adjust the number of experts based on task requirements and computational constraints.

Conclusion

The proposed task-agnostic pruning method effectively addresses the challenge of memory consumption in sparse MoE architectures. By discovering and merging similar experts, this approach not only reduces memory usage but also maintains high performance across various tasks. This contribution is valuable for enhancing the practicality of deploying large-scale models in diverse settings.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zeliang Zhang (34 papers)
  2. Xiaodong Liu (162 papers)
  3. Hao Cheng (190 papers)
  4. Chenliang Xu (114 papers)
  5. Jianfeng Gao (344 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com