BlockPruner: Fine-grained Pruning for Large Language Models

Published 15 Jun 2024 in cs.CL | (2406.10594v4)

Abstract: With the rapid growth in the size and complexity of LLMs, the costs associated with their training and inference have escalated significantly. Research indicates that certain layers in LLMs harbor substantial redundancy, and pruning these layers has minimal impact on the overall performance. While various layer pruning methods have been developed based on this insight, they generally overlook the finer-grained redundancies within the layers themselves. In this paper, we delve deeper into the architecture of LLMs and demonstrate that finer-grained pruning can be achieved by targeting redundancies in multi-head attention (MHA) and multi-layer perceptron (MLP) blocks. We propose a novel, training-free structured pruning approach called BlockPruner. Unlike existing layer pruning methods, BlockPruner segments each Transformer layer into MHA and MLP blocks. It then assesses the importance of these blocks using perplexity measures and applies a heuristic search for iterative pruning. We applied BlockPruner to LLMs of various sizes and architectures and validated its performance across a wide range of downstream tasks. Experimental results show that BlockPruner achieves more granular and effective pruning compared to state-of-the-art baselines.

Abstract PDF HTML Upgrade to Chat

Authors (5)

Citations (3)

View on Semantic Scholar

Summary

The paper introduces BlockPruner, a method that decomposes Transformer layers into MHA and MLP blocks to target finer-grained redundancies without significant performance loss.
The iterative search algorithm employs perplexity measures to accurately assess block importance, outperforming traditional layer-level pruning techniques.
Experimental results demonstrate that BlockPruner achieves superior compression on models like Llama2, Baichuan2, and Qwen1.5 while maintaining robust downstream performance.

BlockPruner: Fine-grained Pruning for LLMs

The paper "BlockPruner: Fine-grained Pruning for LLMs" by Longguang Zhong et al. presents an innovative approach to the structured pruning of LLMs. The proposed method, named BlockPruner, enhances model compression by targeting finer-grained redundancies within LLM layers, specifically focusing on MHA and MLP blocks.

Introduction

The paper identifies that the increasing size and complexity of LLMs have resulted in substantial computational demands, challenging their deployment in resource-limited environments. Traditional model compression techniques, including knowledge distillation, quantization, and pruning, aim to reduce these demands without significantly impacting performance. BlockPruner diverges from traditional approaches by addressing redundancies at a finer level of granularity within layers.

Recent studies reveal that LLMs contain redundant layers, which can be pruned without severely affecting performance. While these methods preserve the model's overall architecture, they overlook redundancies within MHA and MLP blocks. Recognizing this, BlockPruner segments Transformer layers into MHA and MLP blocks, assessing their importance using perplexity measures, followed by a heuristic pruning search.

Figure 1: Block Influence (BI) scores \citep{men2024shortgpt} for Llama2-7B indicate finer-grained redundancies at the block level compared to layer level.

Methodology

Minimal Residual Block

BlockPruner operates by decomposing each Transformer layer into two residual blocks: MHA and MLP (Figure 2). This decomposition allows for finer control over pruning decisions, targeting specific blocks rather than entire layers.

Figure 2: Illustration depicting that a Transformer layer can be subdivided into two residual blocks.

Block Importance

The method assesses block importance through perplexity, a measure reflecting the impact of removing a block on the model's overall performance. This global metric captures the block's cumulative influence, as opposed to local measurements like Block Influence or Relative Magnitude.

Iterative Search for Block Pruning

BlockPruner iteratively removes blocks with the lowest importance scores, employing a calibration dataset to ensure minimal degradation. The iterative search algorithm differs from traditional static pruning approaches, which often remove blocks indiscriminately.

Figure 3: Overview of our BlockPruner, illustrating the iterative pruning process.

Experimental Results

BlockPruner was applied to various LLMs, including Llama2, Baichuan2, and Qwen1.5, across several benchmarks. The results showed that BlockPruner achieves more granular pruning with less performance loss compared to other state-of-the-art methods.

Main Results

BlockPruner consistently outperformed baselines such as SliceGPT, LaCo, ShortGPT, and RM, as shown in Table \ref{tab:main_res}. The approach demonstrated robustness across different pruning ratios, maintaining higher average scores in downstream tasks.

Figure 4: The impact of different block importance metrics on the pruning performance of BlockPruner.

Ablation Study

An ablation study revealed that each component of the BlockPruner algorithm contributes significantly to its effectiveness. Removing the search procedure or substituting block pruning with layer pruning led to substantial performance declines.

Redundancies Between MHA and MLP

Experiments focusing on individual MHA and MLP blocks revealed that MHA blocks exhibit higher redundancy, thus are more frequently pruned with minimal impact on performance. This confirms the nuanced approach of BlockPruner in targeting specific redundancies at the block level.

Figure 5: The proportion of MHA blocks removed during the pruning process.

Impact of Dataset on Pruning

The choice of calibration dataset affects pruning efficacy. The Alpaca dataset, aligned with downstream tasks, consistently delivered better results than other datasets.

Conclusion

BlockPruner presents a novel fine-grained pruning strategy targeting block-level redundancies in LLMs, resulting in efficient model compression with minimal performance degradation. By leveraging perplexity measures and dynamic pruning algorithms, BlockPruner offers a comprehensive solution to optimize LLMs for deployment in resource-constrained environments.

Limitations

Future work should explore alternative block importance metrics and optimization algorithms to refine pruning processes further. Moreover, extending this approach to even larger LLMs could amplify its applicability across diverse applications. The scalability of BlockPruner ensures its adaptability for pruning larger models in subsequent studies.

Markdown Report Issue