Structural Pruning of LLMs with LLM-Pruner
Introduction
LLMs have demonstrated significant abilities in various tasks related to language understanding and generation. Despite their impressive performance, LLMs pose substantial challenges in deployment and inference due to their extensive scale and computational demands. Most existing methods targeting LLM compression focus on task-specific scenarios, which demand heavy reliance on the training corpus and prolonged post-training. To address these issues, this paper introduces LLM-Pruner, a novel framework for the structural pruning of LLMs aimed at compressing the model while preserving its task-agnostic functionalities and minimizing dependency on the initial training dataset.
Methodology
LLM-Pruner is designed as a three-stage process, consisting of Dependency Discovery, Importance Estimation, and Model Recovery.
- Dependency Discovery: The method starts with identifying groups of dependent structures within the LLM. This involves constructing a dependency graph by treating each neuron as a trigger. The iterative process identifies all neurons influenced by an initial trigger, thereby grouping them into dependencies. This automatic detection is crucial for maintaining the structural integrity of the model during pruning.
- Importance Estimation: Once the coupled structures are grouped, their importance is estimated based on first-order gradient information and approximated second-order Hessian information. The importance can be aggregated in several ways, including summing the individual importances or taking the maximum. These estimates allow for an informed pruning process aimed at minimizing the performance disruption to the model.
- Model Recovery: To address potential performance degradation post-pruning, a low-rank approximation method (LoRA) is employed for rapid recovery using a limited dataset. This approach significantly reduces the training complexity and time, allowing for quick fine-tuning of the pruned model.
Experimental Validation
The efficacy of LLM-Pruner was validated on three LLMs: LLaMA-7B, Vicuna-7B, and ChatGLM-6B. The experiments aimed to evaluate the generation quality and zero-shot classification performance post-pruning.
Results Overview
The experimental results indicate that LLM-Pruner effectively reduces the parameters while maintaining a high percentage of the original model's performance. For instance, with a 20% parameter reduction, LLaMA-7B maintained 94.97% of its performance post-tuning, demonstrating the method's robustness. The pruned LLMs also showed competitive performance compared to uncompressed models of similar sizes, such as the ChatGLM-6B.
Discussion
LLM-Pruner offers significant practical implications for deploying LLMs in resource-constrained environments. By reducing the model size and inference latency while preserving task versatility, it allows for more efficient utilization of computational resources. This is particularly beneficial in scenarios where access to the original training dataset is limited or proprietary, making extensive post-training impractical.
Theoretical Implications
From a theoretical standpoint, LLM-Pruner contributes to the body of work on model compression by introducing a method that emphasizes structural preservation. The dependency-based pruning ensures that critical interconnected components are pruned in a cohesive manner, preventing disjointed disruptions that could degrade the model's performance.
Future Developments
Future work could explore extending LLM-Pruner to achieve higher compression rates without significant performance loss. Additionally, enhancing the efficiency of importance estimation methods and refining dependency detection algorithms could further improve the robustness and applicability of the framework.
Conclusion
LLM-Pruner showcases an effective method for the structural pruning of LLMs, achieving substantial compression with minimal performance degradation. By addressing the challenges related to the size and computational demands of LLMs, it opens avenues for broader deployment and practical applications of these models in diverse environments. This approach represents a step forward in the ongoing efforts to make LLMs more accessible and efficient without compromising their core capabilities.