The LLM Surgeon (2312.17244v2)

Published 28 Dec 2023 in cs.LG and cs.CL

Abstract: State-of-the-art LLMs are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative to training smaller models from scratch. To do so, we scale Kronecker-factored curvature approximations of the target loss landscape to LLMs. In doing so, we can compute both the dynamic allocation of structures that can be removed as well as updates of remaining weights that account for the removal. We provide a general framework for unstructured, semi-structured and structured pruning and improve upon weight updates to capture more correlations between weights, while remaining computationally efficient. Experimentally, our method can prune rows and columns from a range of OPT models and Llamav2-7B by 20%-30%, with a negligible loss in performance, and achieve state-of-the-art results in unstructured and semi-structured pruning of LLMs.

References (25)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces a novel compression framework, LLM Surgeon, that prunes LLMs to achieve up to 30% model size reduction without significant accuracy loss.
The approach employs structured, semi-structured, and unstructured pruning using block-diagonal Kronecker-factored curvature approximations for dynamic weight removal.
Empirical results confirm the method’s effectiveness in maintaining downstream task performance through iterative pruning and first-order corrections between shots.

Introduction

State-of-the-art LLMs are growing in size to improve performance on textual data. However, their size raises deployment challenges due to computational, environmental, and device constraints. To address this, there is an interest in compressing these large models without retraining smaller versions from scratch.

Model Compression Framework

This paper introduces a framework called LLM Surgeon that supports unstructured, semi-structured, and structured pruning of LLMs. It scales block-diagonal Kronecker-factored curvature approximations to LLMs, allowing the calculation of dynamic allocation structures for removal and updating remaining weights. The approach involves pruning multiple weights simultaneously and updating them collectively, rather than treating each weight independently.

Pruning Approach

The method starts by estimating the curvature of the loss landscape from training data, which reflects how sensitive the model's performance is to changes in its parameters. Weight removal costs are computed for individual elements, and a global threshold is used to determine how much each layer is pruned. The process proceeds in 'shots,' or steps, where a portion of the model is pruned and the remaining weights are updated iteratively until a target size or sparsity level is reached. Between shots, first-order corrections are optionally used to further improve model accuracy after pruning.

Empirical Evaluation

LLM Surgeon shows successful structured pruning on LLMs for the first time, allowing up to 30% reductions in model size with minimal impact on performance. The method also achieves state-of-the-art results for unstructured and semi-structured pruning. In addition to test performance, the compressed models maintain downstream task effectiveness, demonstrating the practicality of the method. The framework allows a flexible trade-off between additional computation during compression and the accuracy of the final compressed model, positioning it as a valuable approach for efficiently deploying LLMs within constrained environments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/tychovdo/status/1747931516385071423

https://twitter.com/22146921/status/1741216325371101306

https://twitter.com/Dr_Alex_Crimi/status/1847599030529909085

https://twitter.com/1341421416101646336/status/1741333190227501080

https://twitter.com/MRAJABUD/status/1771381867238826392

https://twitter.com/Aimped_AI/status/1793279375288246295