BESA: Pruning Large Language Models with Blockwise Parameter-Efficient Sparsity Allocation (2402.16880v2)
Abstract: LLMs have demonstrated outstanding performance in various tasks, such as text summarization, text question-answering, and etc. While their performance is impressive, the computational footprint due to their vast number of parameters can be prohibitive. Existing solutions such as SparseGPT and Wanda attempt to alleviate this issue through weight pruning. However, their layer-wise approach results in significant perturbation to the model's output and requires meticulous hyperparameter tuning, such as the pruning rate, which can adversely affect overall model performance. To address this, this paper introduces a novel LLM pruning technique dubbed blockwise parameter-efficient sparsity allocation (BESA) by applying a blockwise reconstruction loss. In contrast to the typical layer-wise pruning techniques, BESA is characterized by two distinctive attributes: i) it targets the overall pruning error with respect to individual transformer blocks, and ii) it allocates layer-specific sparsity in a differentiable manner, both of which ensure reduced performance degradation after pruning. Our experiments show that BESA achieves state-of-the-art performance, efficiently pruning LLMs like LLaMA1, and LLaMA2 with 7B to 70B parameters on a single A100 GPU in just five hours. Code is available at https://github.com/OpenGVLab/LLMPrune-BESA.
- Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
- Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 7432–7439, 2020.
- A unified framework for soft threshold pruning. arXiv preprint arXiv:2302.13019, 2023.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
- Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, pp. 2943–2952. PMLR, 2020.
- SparseGPT: Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774, 2023.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
- Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems, 5, 1992.
- Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pp. 293–299. IEEE, 1993.
- Convolution-weight-distribution assumption: Rethinking the criteria of channel pruning. arXiv preprint arXiv:2004.11627, 2020.
- Operation-aware soft channel pruning using differentiable masks. In International Conference on Machine Learning, pp. 5122–5131. PMLR, 2020.
- Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629, 2023.
- Soft threshold weight reparameterization for learnable sparsity. In International Conference on Machine Learning, pp. 5544–5555. PMLR, 2020.
- Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
- Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627, 2023.
- The penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994, 1994.
- Stephen Merity. The wikitext long term dependency language modeling dataset. Salesforce Metamind, 9, 2016.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Omniquant: Omnidirectionally calibrated quantization for large language models. arXiv preprint arXiv:2308.13137, 2023.
- A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
- InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
- Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
- Vitcod: Vision transformer acceleration via dedicated algorithm and accelerator co-design. In The 29th IEEE International Symposium on High-Performance Computer Architecture (HPCA-29), 2023.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
- Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022a.
- Learning best combination for efficient n: M sparsity. Advances in Neural Information Processing Systems, 35:941–953, 2022b.
- Lottery jackpots exist in pre-trained models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
- Peng Xu (357 papers)
- Wenqi Shao (89 papers)
- Mengzhao Chen (19 papers)
- Shitao Tang (15 papers)
- Kaipeng Zhang (73 papers)
- Peng Gao (401 papers)
- Fengwei An (4 papers)
- Yu Qiao (563 papers)
- Ping Luo (340 papers)