LLM-Pruner: On the Structural Pruning of Large Language Models (2305.11627v3)

Published 19 May 2023 in cs.CL

Abstract: LLMs have shown remarkable capabilities in language understanding and generation. However, such impressive capability typically comes with a substantial model size, which presents significant challenges in both the deployment, inference, and training stages. With LLM being a general-purpose task solver, we explore its compression in a task-agnostic manner, which aims to preserve the multi-task solving and language generation ability of the original LLM. One challenge to achieving this is the enormous size of the training corpus of LLM, which makes both data transfer and model post-training over-burdensome. Thus, we tackle the compression of LLMs within the bound of two constraints: being task-agnostic and minimizing the reliance on the original training dataset. Our method, named LLM-Pruner, adopts structural pruning that selectively removes non-critical coupled structures based on gradient information, maximally preserving the majority of the LLM's functionality. To this end, the performance of pruned models can be efficiently recovered through tuning techniques, LoRA, in merely 3 hours, requiring only 50K data. We validate the LLM-Pruner on three LLMs, including LLaMA, Vicuna, and ChatGLM, and demonstrate that the compressed models still exhibit satisfactory capabilities in zero-shot classification and generation. The code is available at: https://github.com/horseee/LLM-Pruner

PDF Abstract

Structural Pruning of LLMs with LLM-Pruner

Introduction

LLMs have demonstrated significant abilities in various tasks related to language understanding and generation. Despite their impressive performance, LLMs pose substantial challenges in deployment and inference due to their extensive scale and computational demands. Most existing methods targeting LLM compression focus on task-specific scenarios, which demand heavy reliance on the training corpus and prolonged post-training. To address these issues, this paper introduces LLM-Pruner, a novel framework for the structural pruning of LLMs aimed at compressing the model while preserving its task-agnostic functionalities and minimizing dependency on the initial training dataset.

Methodology

LLM-Pruner is designed as a three-stage process, consisting of Dependency Discovery, Importance Estimation, and Model Recovery.

Dependency Discovery: The method starts with identifying groups of dependent structures within the LLM. This involves constructing a dependency graph by treating each neuron as a trigger. The iterative process identifies all neurons influenced by an initial trigger, thereby grouping them into dependencies. This automatic detection is crucial for maintaining the structural integrity of the model during pruning.
Importance Estimation: Once the coupled structures are grouped, their importance is estimated based on first-order gradient information and approximated second-order Hessian information. The importance can be aggregated in several ways, including summing the individual importances or taking the maximum. These estimates allow for an informed pruning process aimed at minimizing the performance disruption to the model.
Model Recovery: To address potential performance degradation post-pruning, a low-rank approximation method (LoRA) is employed for rapid recovery using a limited dataset. This approach significantly reduces the training complexity and time, allowing for quick fine-tuning of the pruned model.

Experimental Validation

The efficacy of LLM-Pruner was validated on three LLMs: LLaMA-7B, Vicuna-7B, and ChatGLM-6B. The experiments aimed to evaluate the generation quality and zero-shot classification performance post-pruning.

Results Overview

The experimental results indicate that LLM-Pruner effectively reduces the parameters while maintaining a high percentage of the original model's performance. For instance, with a 20% parameter reduction, LLaMA-7B maintained 94.97% of its performance post-tuning, demonstrating the method's robustness. The pruned LLMs also showed competitive performance compared to uncompressed models of similar sizes, such as the ChatGLM-6B.

Discussion

LLM-Pruner offers significant practical implications for deploying LLMs in resource-constrained environments. By reducing the model size and inference latency while preserving task versatility, it allows for more efficient utilization of computational resources. This is particularly beneficial in scenarios where access to the original training dataset is limited or proprietary, making extensive post-training impractical.

Theoretical Implications

From a theoretical standpoint, LLM-Pruner contributes to the body of work on model compression by introducing a method that emphasizes structural preservation. The dependency-based pruning ensures that critical interconnected components are pruned in a cohesive manner, preventing disjointed disruptions that could degrade the model's performance.

Future Developments

Future work could explore extending LLM-Pruner to achieve higher compression rates without significant performance loss. Additionally, enhancing the efficiency of importance estimation methods and refining dependency detection algorithms could further improve the robustness and applicability of the framework.

Conclusion

LLM-Pruner showcases an effective method for the structural pruning of LLMs, achieving substantial compression with minimal performance degradation. By addressing the challenges related to the size and computational demands of LLMs, it opens avenues for broader deployment and practical applications of these models in diverse environments. This approach represents a step forward in the ongoing efforts to make LLMs more accessible and efficient without compromising their core capabilities.