Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
92 tokens/sec
Gemini 2.5 Pro Premium
50 tokens/sec
GPT-5 Medium
15 tokens/sec
GPT-5 High Premium
23 tokens/sec
GPT-4o
97 tokens/sec
DeepSeek R1 via Azure Premium
87 tokens/sec
GPT OSS 120B via Groq Premium
466 tokens/sec
Kimi K2 via Groq Premium
201 tokens/sec
2000 character limit reached

STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning (2409.06211v1)

Published 10 Sep 2024 in cs.LG and cs.CL

Abstract: Mixture-of-experts (MoEs) have been adopted for reducing inference costs by sparsely activating experts in LLMs. Despite this reduction, the massive number of experts in MoEs still makes them expensive to serve. In this paper, we study how to address this, by pruning MoEs. Among pruning methodologies, unstructured pruning has been known to achieve the highest performance for a given pruning ratio, compared to structured pruning, since the latter imposes constraints on the sparsification structure. This is intuitive, as the solution space of unstructured pruning subsumes that of structured pruning. However, our counterintuitive finding reveals that expert pruning, a form of structured pruning, can actually precede unstructured pruning to outperform unstructured-only pruning. As existing expert pruning, requiring $O(\frac{kn}{\sqrt{n}})$ forward passes for $n$ experts, cannot scale for recent MoEs, we propose a scalable alternative with $O(1)$ complexity, yet outperforming the more expensive methods. The key idea is leveraging a latent structure between experts, based on behavior similarity, such that the greedy decision of whether to prune closely captures the joint pruning effect. Ours is highly effective -- for Snowflake Arctic, a 480B-sized MoE with 128 experts, our method needs only one H100 and two hours to achieve nearly no loss in performance with 40% sparsity, even in generative tasks such as GSM8K, where state-of-the-art unstructured pruning fails to. The code will be made publicly available.

Summary

  • The paper introduces STUN, a method that combines structured expert pruning with subsequent unstructured pruning to efficiently reduce MoE model size.
  • It presents an O(1) scalable expert pruning technique that leverages latent structural relationships among experts to significantly lower computational costs.
  • Empirical results on a 480B-parameter MoE model show that STUN achieves 40% sparsity without compromising performance on challenging generative tasks.

An Overview of "STUN: Structured-Then-Unstructured Pruning for Scalable MoE Pruning"

In the given paper, the authors present a novel approach titled Structured-Then-Unstructured Pruning (STUN) to enhance the efficiency of Mixture-of-Experts (MoEs) models. The paper addresses the prevailing challenge of reducing the computational footprint and memory requirements of MoE-based LLMs while maintaining model performance.

Key Contributions

The authors identify a significant gap in the existing methodologies for pruning MoEs. Traditionally, unstructured pruning has been favored due to its superior performance at higher pruning ratios. However, the authors' counterintuitive insight reveals that preceding unstructured pruning with a judiciously applied structured pruning—specifically, expert pruning—can yield better overall outcomes.

  1. Structured-Then-Unstructured Pruning (STUN):
    • The proposed method first applies structured pruning to remove entire experts within the MoE architecture, followed by unstructured pruning techniques.
    • This approach aligns with the robustness attributes of MoEs, where the network is inherently designed to maintain performance even if some experts are excluded.
  2. Scalable Expert Pruning:
    • A critical innovation is the reduction of computational complexity in expert pruning. Existing methodologies necessitate $O(\frac{k^n}{\sqrt{n})$ GPU calls for combinations of experts, making them impractical for large MoEs with numerous experts.
    • The paper introduces an O(1)O(1) complexity method by capitalizing on latent structural relationships between experts based on behavior similarity, thus ensuring scalable and efficient pruning.
  3. Empirical Validation:
    • The authors utilize Snowflake Arctic, a 480B-parameter MoE with 128 experts, showcasing that STUN can prune the network to achieve 40% sparsity in just two hours on a single H100 GPU.
    • Despite this level of pruning, the network's performance remains nearly uncompromised, even in challenging generative tasks such as the GSM8K benchmark.

Results and Findings

The results presented offer compelling support for the proposed STUN methodology:

  • STUN vs. Unstructured Pruning:
    • STUN significantly outperforms unstructured pruning methods alone across various models and tasks. For instance, at 40% sparsity, STUN retains GSM8K's original accuracy, whereas unstructured pruning methods show notable performance degradation.
    • The performance gains of STUN are consistent across different pruning methods, like OWL and Wanda.
  • Expert Pruning Efficiency:
    • The authors demonstrate that their O(1)O(1) expert pruning methodology outperforms the more computationally expensive existing approaches. This efficiency is realized by leveraging pretrained model weights to discern latent clustering structures among experts, achieving superior pruning outcomes without extensive computational overhead.
  • Scalability and Robustness:
    • The paper illustrates that STUN scales favorably with the increasing trend of deploying large MoE models comprising numerous small experts, which provides more pruning flexibility.
    • Furthermore, the paper establishes that MoEs inherently exhibit robustness to structured pruning and that the expert-pruned networks remain robust to subsequent unstructured pruning, further enhancing the efficacy of the STUN approach.

Implications and Future Directions

The STUN framework has significant implications for the deployment and scalability of MoE-based LLMs. By demonstrating that a strategic combination of structured and unstructured pruning can maintain, and in some cases even enhance, performance, the research provides a foundational advancement for optimizing large-scale models.

Practical Implications:

  • The methodology facilitates the deployment of large MoE models in resource-constrained environments by substantially reducing the memory and computational requirements.
  • With the potential to maintain high performance with fewer computational resources, this approach can democratize access to advanced LLMs, enabling broader usage in various applications.

Theoretical Implications:

  • The work opens new avenues for exploring the latent structures within neural networks, particularly in the context of model pruning.
  • The robust performance of STUN suggests that exploring more hybrid pruning methodologies could yield further improvements, not only in MoEs but also in other neural network architectures.

Future Developments:

  • The correlation between pruning robustness and the kurtosis of weights introduces an intriguing dimension for future investigations. Further work could seek to refine these theoretical underpinnings, optimizing pruning strategies even further.
  • Expanding the scope of STUN to non-MoE models and integrating the approach with emerging model architectures will be an interesting direction for future research.

In conclusion, the STUN methodology offers a scalable, efficient, and effective alternative to traditional unstructured pruning techniques for MoE models, presenting a significant step forward in the optimization of large neural networks. The approach's robustness, scalability, and empirical success underscore its potential for broad application across various domains of AI research and deployment.