Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The LLM Surgeon (2312.17244v2)

Published 28 Dec 2023 in cs.LG and cs.CL

Abstract: State-of-the-art LLMs are becoming increasingly large in an effort to achieve the highest performance on large corpora of available textual data. However, the sheer size of the Transformer architectures makes it difficult to deploy models within computational, environmental or device-specific constraints. We explore data-driven compression of existing pretrained models as an alternative to training smaller models from scratch. To do so, we scale Kronecker-factored curvature approximations of the target loss landscape to LLMs. In doing so, we can compute both the dynamic allocation of structures that can be removed as well as updates of remaining weights that account for the removal. We provide a general framework for unstructured, semi-structured and structured pruning and improve upon weight updates to capture more correlations between weights, while remaining computationally efficient. Experimentally, our method can prune rows and columns from a range of OPT models and Llamav2-7B by 20%-30%, with a negligible loss in performance, and achieve state-of-the-art results in unstructured and semi-structured pruning of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. Pattern recognition and machine learning, volume 4. Springer, 2006.
  2. Practical gauss-newton optimisation for deep learning. In International Conference on Machine Learning, pp. 557–565. PMLR, 2017.
  3. Optimal brain compression: A framework for accurate post-training quantization and pruning. Advances in Neural Information Processing Systems, 35:4475–4488, 2022.
  4. Sparsegpt: Massive language models can be accurately pruned in one-shot. 2023.
  5. Matrix computations. JHU press, 2013.
  6. Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems, 5, 1992.
  7. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  8. Accurate post training quantization with small calibration sets. In International Conference on Machine Learning, pp. 4466–4475. PMLR, 2021.
  9. Invariance learning in deep neural networks with differentiable laplace approximations. Advances in Neural Information Processing Systems, 35:12449–12463, 2022.
  10. Efficient approximations of the fisher matrix in neural networks using kronecker product singular value decomposition. arXiv preprint arXiv:2201.10285, 2022.
  11. Limitations of the empirical fisher approximation for natural gradient descent. Advances in neural information processing systems, 32, 2019.
  12. The optimal bert surgeon: Scalable and accurate second-order pruning for large language models. arXiv preprint arXiv:2203.07259, 2022.
  13. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
  14. Learning sparse neural networks through l⁢_⁢0𝑙_0l\_0italic_l _ 0 regularization. arXiv preprint arXiv:1712.01312, 2017.
  15. David JC MacKay. Information theory, inference and learning algorithms. Cambridge university press, 2003.
  16. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408–2417. PMLR, 2015.
  17. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  18. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  19. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  20. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  21. Eigendamage: Structured pruning in the kronecker-factored eigenbasis. In International conference on machine learning, pp. 6566–6575. PMLR, 2019.
  22. Wikipedia. Wikipedia. PediaPress, 2004.
  23. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  24. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  25. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010, 2021.
Citations (8)

Summary

  • The paper introduces a novel compression framework, LLM Surgeon, that prunes LLMs to achieve up to 30% model size reduction without significant accuracy loss.
  • The approach employs structured, semi-structured, and unstructured pruning using block-diagonal Kronecker-factored curvature approximations for dynamic weight removal.
  • Empirical results confirm the method’s effectiveness in maintaining downstream task performance through iterative pruning and first-order corrections between shots.

Introduction

State-of-the-art LLMs are growing in size to improve performance on textual data. However, their size raises deployment challenges due to computational, environmental, and device constraints. To address this, there is an interest in compressing these large models without retraining smaller versions from scratch.

Model Compression Framework

This paper introduces a framework called LLM Surgeon that supports unstructured, semi-structured, and structured pruning of LLMs. It scales block-diagonal Kronecker-factored curvature approximations to LLMs, allowing the calculation of dynamic allocation structures for removal and updating remaining weights. The approach involves pruning multiple weights simultaneously and updating them collectively, rather than treating each weight independently.

Pruning Approach

The method starts by estimating the curvature of the loss landscape from training data, which reflects how sensitive the model's performance is to changes in its parameters. Weight removal costs are computed for individual elements, and a global threshold is used to determine how much each layer is pruned. The process proceeds in 'shots,' or steps, where a portion of the model is pruned and the remaining weights are updated iteratively until a target size or sparsity level is reached. Between shots, first-order corrections are optionally used to further improve model accuracy after pruning.

Empirical Evaluation

LLM Surgeon shows successful structured pruning on LLMs for the first time, allowing up to 30% reductions in model size with minimal impact on performance. The method also achieves state-of-the-art results for unstructured and semi-structured pruning. In addition to test performance, the compressed models maintain downstream task effectiveness, demonstrating the practicality of the method. The framework allows a flexible trade-off between additional computation during compression and the accuracy of the final compressed model, positioning it as a valuable approach for efficiently deploying LLMs within constrained environments.