PERP: Rethinking the Prune-Retrain Paradigm in the Era of LLMs (2312.15230v2)
Abstract: Neural Networks can be efficiently compressed through pruning, significantly reducing storage and computational demands while maintaining predictive performance. Simple yet effective methods like Iterative Magnitude Pruning (IMP, Han et al., 2015) remove less important parameters and require a costly retraining procedure to recover performance after pruning. However, with the rise of LLMs, full retraining has become infeasible due to memory and compute constraints. In this study, we challenge the practice of retraining all parameters by demonstrating that updating only a small subset of highly expressive parameters is often sufficient to recover or even improve performance compared to full retraining. Surprisingly, retraining as little as 0.27%-0.35% of the parameters of GPT-architectures achieves comparable performance to One Shot IMP across various sparsity levels. Our approach, Parameter-Efficient Retraining after Pruning (PERP), drastically reduces compute and memory demands, enabling pruning and retraining of up to 30 billion parameter models on a single NVIDIA A100 GPU within minutes. Despite magnitude pruning being considered as unsuited for pruning LLMs, our findings show that PERP positions it as a strong contender against state-of-the-art retraining-free approaches such as Wanda (Sun et al., 2023) and SparseGPT (Frantar & Alistarh, 2023), opening up a promising alternative to avoiding retraining.
- Intrinsic dimensionality explains the effectiveness of language model fine-tuning. December 2020.
- Layer normalization. July 2016.
- Sparse networks from scratch: Faster training without losing performance. arXiv preprint arXiv:1907.04840, July 2019.
- Llm.int8(): 8-bit matrix multiplication for transformers at scale. August 2022.
- Global sparse momentum sgd for pruning very deep neural networks. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/f34185c4ca5d58e781d4f14173d41e5d-Paper.pdf.
- Rigging the lottery: Making all tickets winners. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp. 2943–2952. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/evci20a.html.
- Head2toe: Utilizing intermediate representations for better transfer learning. ICML 2022, Proceedings of the 39th International Conference on Machine Learning, January 2022.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2018.
- Training batchnorm and only batchnorm: On the expressive power of random features in cnns. February 2020.
- Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pp. 10323–10337. PMLR, 2023.
- The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574, 2019.
- The expressive power of tuning only the normalization layers. February 2023.
- Learning both weights and connections for efficient neural networks. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/ae0eb3eed39d2bcef4622b2499a05fe6-Paper.pdf.
- Second order derivatives for network pruning: Optimal brain surgeon. In Hanson, S., Cowan, J., and Giles, C. (eds.), Advances in Neural Information Processing Systems, volume 5. Morgan-Kaufmann, 1993. URL https://proceedings.neurips.cc/paper/1992/file/303ed4c69846ab36c2904d3ba8573050-Paper.pdf.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
- Sparseadapter: An easy approach for improving the parameter-efficiency of adapters. October 2022.
- Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. arXiv preprint arXiv:2102.00554, January 2021.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. June 2021.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Bach, F. R. and Blei, D. M. (eds.), Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pp. 448–456. JMLR.org, 2015. URL http://proceedings.mlr.press/v37/ioffe15.html.
- Compressing llms: The truth is rarely pure and never simple. October 2023a.
- Instant soup: Cheap pruning ensembles in a single pass can draw lottery tickets from large models. In International Conference on Machine Learning, pp. 14691–14701. PMLR, 2023b.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Imagenet classification with deep convolutional neural networks. In Pereira, F., Burges, C., Bottou, L., and Weinberger, K. (eds.), Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
- Fine-tuning can distort pretrained features and underperform out-of-distribution. February 2022a.
- How to fine-tune vision models with sgd. November 2022b.
- A fast post-training pruning framework for transformers. March 2022.
- Network pruning that matters: A case study on retraining variants. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Cb54AMqHQFP.
- Optimal brain damage. In Touretzky, D. S. (ed.), Advances in Neural Information Processing Systems 2, [NIPS Conference, Denver, Colorado, USA, November 27-30, 1989], pp. 598–605. Morgan Kaufmann, 1989. URL http://papers.nips.cc/paper/250-optimal-brain-damage.
- Layer-adaptive sparsity for the magnitude-based pruning. In International Conference on Learning Representations, October 2020.
- Eagleeye: Fast sub-net evaluation for efficient neural network pruning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp. 639–654. Springer, 2020.
- Parameter-efficient sparsity for large language models fine-tuning. May 2022.
- Scaling down to scale up: A guide to parameter-efficient fine-tuning. March 2023a.
- Stack more layers differently: High-rank training through low-rank updates. July 2023b.
- Dynamic model pruning with feedback. In International Conference on Learning Representations, 2020.
- Transtailor: Pruning the pre-trained model for improved transfer learning. March 2021.
- Dynamic sparse training: Find efficient sparse network from scratch with trainable masked layers. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SJlbGJrtDB.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- Pointer sentinel mixture models. September 2016.
- Accelerating sparse deep neural networks. April 2021.
- Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9(1), June 2018. doi: 10.1038/s41467-018-04316-3.
- Pruning convolutional neural networks for resource efficient inference. November 2016.
- K for the price of 1: Parameter-efficient multi-task and transfer learning. October 2018.
- Deep neural network training with frank-wolfe. arXiv preprint arXiv:2010.07243, 2020.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Comparing rewinding and fine-tuning in neural network pruning. In International Conference on Learning Representations, 2020.
- ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015. doi: 10.1007/s11263-015-0816-y.
- A simple and effective pruning approach for large language models. June 2023.
- Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems 2020, June 2020.
- Efficient fine-tuning of bert models on the edge. May 2022. doi: 10.1109/ISCAS48785.2022.9937567.
- How does calibration data affect the post-training pruning and quantization of large language models? November 2023.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
- How much pre-training is enough to discover a good subnetwork? July 2021.
- Discovering neural wirings. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/d010396ca8abf6ead8cacc2c2f2f26c7-Paper.pdf.
- Pruning by explaining: A novel criterion for deep neural network pruning. December 2019.
- Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity. October 2023.
- Drawing early-bird tickets: Toward more efficient training of deep networks. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BJxsrgStvr.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. June 2021.
- Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, November 2016.
- Pruning meets low-rank parameter-efficient fine-tuning. May 2023a.
- Opt: Open pre-trained transformer language models. May 2022.
- Dynamic sparse no training: Training-free fine-tuning for sparse llms. October 2023b.
- Efficient lottery ticket finding: Less data is more. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 12380–12390. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/zhang21c.html.
- To prune, or not to prune: Exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878, October 2017.
- Compression-aware training of neural networks using frank-wolfe. arXiv preprint arXiv:2205.11921, 2022.
- How I Learned To Stop Worrying And Love Retraining. In International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=_nF5imFKQI.
- Sparse model soups: A recipe for improved pruning via model averaging. June 2023b.
- Max Zimmer (16 papers)
- Megi Andoni (1 paper)
- Christoph Spiegel (25 papers)
- Sebastian Pokutta (133 papers)