Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fast and Effective Weight Update for Pruned Large Language Models (2401.02938v2)

Published 1 Jan 2024 in cs.CL and cs.LG

Abstract: Pruning LLMs is a challenging task due to their enormous size. The primary difficulty is fine-tuning the model after pruning, which is needed to recover the lost performance caused by dropping weights. Recent approaches have either ignored fine-tuning entirely, focusing on efficient pruning criteria, or attempted layer-wise weight updates, preserving the behavior of each layer. However, even layer-wise weight updates can be costly for LLMs, and previous works have resorted to various approximations. In our paper, we propose a fast and effective weight update algorithm for pruned layers based on the Alternating Direction Method of Multipliers (ADMM). We further extend it with a simple gradual pruning mask selection and achieve state-of-the-art pruning performance across a wide range of LLMs. Code is available at https://github.com/fmfi-compbio/admm-pruning.

Introduction

The ongoing development of LLMs has led to remarkable advances in a wide variety of language tasks. Nevertheless, the deployment of these models poses significant challenges, mainly attributed to their size, which results in substantial memory and computational resource requirements. While previous attempts to tackle these issues have included methods such as parameter quantization and pruning, the latter approach has not gained as much traction primarily due to difficulties in fine-tuning pruned networks. Existing solutions have either overlooked fine-tuning or utilized layer-wise weight updates, which, although intentioned to be efficient, are still costly and often resort to approximations, particularly in the context of LLMs.

Optimizing Pruning via Alternating Direction Method of Multipliers (ADMM)

In this context, the paper introduces a novel efficient algorithm for updating the weights of pruned LLMs based on the Alternating Direction Method of Multipliers (ADMM), a mathematical optimization technique. Coupled with a straightforward iterative pruning mask selection, this algorithm bypasses the issues faced by predecessors and achieves state-of-the-art performance in pruning without compromising the model's ability to recover and maintain its original functionality. The paper's ADMM-based solution requires only a single matrix inversion and a few simple iterations, yielding optimal weight updates for given pruning masks.

Pruning Mask Selection and Weight Update

To perform the pruning effectively, the paper also examines how to select the mask for pruning. Incorporating insights from recent literature on pruning mask selection, the authors implement a norm-based rule to determine the significance of weights and their eligibility for removal. The process is fine-tuned using a preconditioning step that scales weight matrices and calibration inputs, making subsequent pruning decisions equivalent to those suggested by the Wanda algorithm. Their approach to iterative pruning invokes a sparsity schedule, chunking the pruning process across several steps, which not only ensures gradual reduction in model size, but also allows for concurrent optimizations.

Experimental Validation and Conclusions

Through extensive experimentation, the paper validates the proposed algorithm against alternative methods like SparseGPT and Adaprune, demonstrating superior convergence speeds and quality of weight updates. The algorithm's efficacy is illustrated in tests conducted with LLMs across a spectrum of pruning sparsities, revealing that the new method outperforms existing approaches, particularly in iterative pruning setups. Despite focusing on weight update post-pruning, the authors acknowledge the limitations of their paper, which include not capitalizing on sparsity during computations and leave potential improvements for future work, such as incorporating nonuniform sparsity or more nuanced mask selection algorithms.

The research concludes on a high note, asserting the ADMM-based weight update method as a sound and practical solution for enhancing the scalability and deployment feasibility of LLMs. The contribution is commendable not only for pushing the boundaries of pruning performance but also in setting a benchmark for future research in this crucial area of deep learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Intriguing properties of quantization at scale. arXiv preprint arXiv:2305.19268, 2023.
  2. Benzing, F. Gradient descent on neurons and its link to approximate second-order optimization. In International Conference on Machine Learning, pp.  1817–1853. PMLR, 2022.
  3. What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129–146, 2020.
  4. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  7. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  8. The case for 4-bit precision: k-bit inference scaling laws. In International Conference on Machine Learning, pp.  7750–7774. PMLR, 2023.
  9. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  10. Optimal brain compression: A framework for accurate post-training quantization and pruning. Advances in Neural Information Processing Systems, 35:4475–4488, 2022.
  11. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pp.  10323–10337. PMLR, 2023.
  12. A framework for few-shot language model evaluation. Version v0. 0.1. Sept, 2021.
  13. Model compression with adversarial robustness: A unified optimization framework. Advances in Neural Information Processing Systems, 32, 2019.
  14. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
  15. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pp.  293–299. IEEE, 1993.
  16. Accelerated sparse neural training: A provable and efficient method to find n: m transposable masks. Advances in neural information processing systems, 34:21099–21111, 2021.
  17. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
  18. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270, 2018.
  19. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  20. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  21. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  22. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  23. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  24. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  25. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  26. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  27. Flash-llm: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity. arXiv preprint arXiv:2309.10285, 2023.
  28. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp.  38087–38099. PMLR, 2023.
  29. Adversarial robustness vs. model compression, or both? In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  111–120, 2019.
  30. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. arXiv preprint arXiv:2310.05175, 2023.
  31. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  32. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  33. A systematic dnn weight pruning framework using alternating direction method of multipliers. In Proceedings of the European conference on computer vision (ECCV), pp.  184–199, 2018.
  34. To prune, or not to prune: Exploring the efficacy of pruning for model compression. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Workshop Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=Sy1iIDkPM.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Vladimír Boža (8 papers)
Citations (4)
Github Logo Streamline Icon: https://streamlinehq.com