Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MINI-LLM: Memory-Efficient Structured Pruning for Large Language Models (2407.11681v1)

Published 16 Jul 2024 in cs.CL

Abstract: As LLMs grow dramatically in size, there is an increasing trend in compressing and speeding up these models. Previous studies have highlighted the usefulness of gradients for importance scoring in neural network compressing, especially in pruning medium-size networks. However, the substantial memory requirements involved in calculating gradients with backpropagation impede the utilization of gradients in guiding LLM pruning. As a result, most pruning strategies for LLMs rely on gradient-free criteria, such as weight magnitudes or a mix of magnitudes and activations. In this paper, we devise a hybrid pruning criterion, which appropriately integrates magnitude, activation, and gradient to capitalize on feature map sensitivity for pruning LLMs. To overcome memory requirement barriers, we estimate gradients using only forward passes. Based on this, we propose a Memory-effIcieNt structured prunIng procedure for LLMs (MINI-LLM) to remove no-critical channels and multi-attention heads. Experimental results demonstrate the superior performance of MINI-LLM over existing gradient-free methods on three LLMs: LLaMA, BLOOM, and OPT across various downstream tasks (classification, multiple-choice, and generation), while MINI-LLM maintains a GPU memory footprint akin to gradient-free methods.

Overview of "MINI-LLM: Memory-Efficient Structured Pruning for LLMs"

The paper at hand introduces MINI-LLM, a novel approach towards enhancing the efficiency of LLMs by employing a memory-efficient structured pruning mechanism. As LLMs continue to expand in size, there emerges an intrinsic necessity to compress and accelerate these models for practical applications. The authors of this paper address this need by leveraging gradients in an innovative manner to guide the pruning process.

Key Contributions

  1. Feature Map Sensitivity Criterion: The paper introduces a novel pruning criterion called the Feature Map Sensitivity (FMS) score. This criterion judiciously blends weight magnitudes, activations, and gradients to interpret the sensitivity of feature maps in LLMs. By doing so, it provides a comprehensive and refined mechanism to appraise weight importance in the model.
  2. Gradient Estimation via Forward Passes: To surmount the memory intensity associated with gradient computation through backpropagation, the authors present a method to estimate gradients using forward passes alone. This elegant solution draws from Zeroth-Order (ZO) optimization techniques, facilitating gradient approximation without the usual high memory overhead.
  3. Practical Implementation and Recovery: The proposed MINI-LLM framework not only performs structured pruning but also utilizes Low-rank Adaption (LoRA) to recover model performance post-pruning. This reflection into LoRA highlights its efficacy in retaining accuracy while ensuring resource savings during fine-tuning stages.

Experimental Validation

MINI-LLM is validated across several LLM architectures—LLaMA, BLOOM, and OPT—demonstrating its adaptability and robustness. The structured pruning approach is analyzed in the context of various downstream tasks, namely text classification, multiple-choice questionnaire answering, and text generation. In these scenarios:

  • MINI-LLM surpasses existing gradient-free pruning techniques.
  • It often achieves competitive, if not superior, performance compared to gradient-based methods that involve backpropagation, while maintaining a lower GPU memory footprint.

Theoretical and Practical Implications

The introduction of the Feature Map Sensitivity criterion pushes the theoretical boundaries of pruning strategies by combining multiple informative signals from the model's architecture. This union of gradient and activation metrics tailors pruning to the nuanced requirements of LLMs.

Practically, the ZO-based gradient estimation marks a significant advancement, enabling the deployment of gradient-informed pruning in resource-constrained environments. This could particularly impact deployments where GPU memory is a limiting factor, such as edge devices or scalable cloud implementations.

Future Research Directions

The advancement delineated in this paper sets a precedent for future exploration in the domain of model compression. Specifically, there are several promising extensions and questions arising, such as:

  • Could MINI-LLM's principles be extended to real-time task-specific model pruning dynamically without noticeable performance degradation?
  • How can the gradient estimation approach be further optimized or adapted to other neural network optimizations beyond pruning?

In conclusion, the MINI-LLM framework contributes significantly to the field of model compression by elucidating a practical, memory-efficient approach to structured pruning. The confluence of theoretical insight and computational frugality embodied in this work paves a promising path for future explorations across various facets of neural network optimization.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Hongrong Cheng (3 papers)
  2. Miao Zhang (146 papers)
  3. Javen Qinfeng Shi (34 papers)
Citations (1)
Youtube Logo Streamline Icon: https://streamlinehq.com