Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training (2407.20584v3)

Published 30 Jul 2024 in cs.CL and cs.AI

Abstract: The remarkable success of LLMs relies heavily on their substantial scale, which poses significant challenges during model deployment in terms of latency and memory consumption. Recently, numerous studies have attempted to compress LLMs using one-shot pruning methods. However, these methods often suffer from considerable performance degradation on complex language understanding tasks, raising concerns about the feasibility of pruning in LLMs. To address this issue, we propose Adaptive Sparse Trainer (AST), a novel and efficient retraining framework tailored for semi-structured sparse models. AST enables models to learn optimal masks during the weight update process without incurring additional computational overhead. Furthermore, we demonstrate that incorporating knowledge distillation significantly improves retraining efficiency and enhances model performance under fixed computational constraints. Additionally, a supplementary set of well-initialized parameters is integrated to further augment the model's efficacy. AST achieves state-of-the-art performance with minimal training cost. When applied to the LLaMA2-7B model, AST reduces the perplexity and zero-shot accuracy gap between dense and 2:4 semi-structured sparse models to 0.6 and 1.16%, respectively, utilizing less than 0.4% of the pretraining tokens and GPU hours. Our work demonstrates the feasibility of deploying semi-structured sparse LLMs and offers a promising alternative for achieving highly compressed models when combined with existing quantization techniques.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  3. Simulated annealing. Statistical science, 8(1):10–15, 1993.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Boolq: Exploring the surprising difficulty of natural yes/no questions. CoRR, abs/1905.10044, 2019.
  6. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  8. Spdy: Accurate pruning with speedup guarantees. In International Conference on Machine Learning, pages 6726–6743. PMLR, 2022.
  9. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pages 10323–10337. PMLR, 2023.
  10. A framework for few-shot language model evaluation. Version v0. 0.1. Sept, page 8, 2021.
  11. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
  12. Knowledge distillation: A survey. International Journal of Computer Vision, 129(6):1789–1819, 2021.
  13. Compressing large language models by joint sparsification and quantization. In Forty-first International Conference on Machine Learning, 2024.
  14. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
  15. Learning both weights and connections for efficient neural network. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
  16. Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
  17. Optimal brain surgeon and general network pruning. In IEEE international conference on neural networks, pages 293–299. IEEE, 1993.
  18. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  19. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  20. Lora: Low-rank adaptation of large language models, 2021.
  21. Accelerating transformer pre-training with 2:4 sparsity, 2024.
  22. Accelerated sparse neural training: A provable and efficient method to find n: m transposable masks. Advances in neural information processing systems, 34:21099–21111, 2021.
  23. Compressing llms: The truth is rarely pure and never simple, 2024.
  24. Andrej Karpathy. nanogpt. https://github.com/karpathy/nanoGPT/, 2023.
  25. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  26. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951.
  27. Gmp*: Well-tuned gradual magnitude pruning can outperform most bert-pruning methods. arXiv preprint arXiv:2210.06384, 2022.
  28. Sparse finetuning for inference acceleration of large language models. arXiv preprint arXiv:2310.06927, 2023.
  29. Optimal brain damage. Advances in neural information processing systems, 2, 1989.
  30. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023.
  31. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE international conference on computer vision, pages 2736–2744, 2017.
  32. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270, 2018.
  33. Step: Learning n: M structured sparsity masks from scratch with precondition. In International Conference on Machine Learning, pages 22812–22824. PMLR, 2023.
  34. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
  35. Importance estimation for neural network pruning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11264–11272, 2019.
  36. Gradient-free structured pruning with unlabeled data. In International Conference on Machine Learning, pages 26326–26341. PMLR, 2023.
  37. Fantastic weights and how to find them: Where to prune in dynamic sparse training, 2023.
  38. Unmasking the lottery ticket hypothesis: What’s encoded in a winning ticket’s mask? arXiv preprint arXiv:2210.03044, 2022.
  39. Towards understanding knowledge distillation. In International conference on machine learning, pages 5142–5151. PMLR, 2019.
  40. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  41. Comparing rewinding and fine-tuning in neural network pruning. arXiv preprint arXiv:2003.02389, 2020.
  42. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  43. Structural pruning via latency-saliency knapsack. Advances in Neural Information Processing Systems, 35:12894–12908, 2022.
  44. Woodfisher: Efficient second-order approximation for neural network compression. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 18098–18109. Curran Associates, Inc., 2020.
  45. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  46. Llama 2: Open foundation and fine-tuned chat models, 2023.
  47. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  48. Sheared llama: Accelerating language model pre-training via structured pruning. arXiv preprint arXiv:2310.06694, 2023.
  49. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  50. Opt: Open pre-trained transformer language models, 2022.
  51. Plug-and-play: An efficient post-training pruning method for large language models. In The Twelfth International Conference on Learning Representations, 2024.
  52. Dynamic sparse no training: Training-free fine-tuning for sparse llms. arXiv preprint arXiv:2310.08915, 2023.
  53. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv preprint arXiv:2102.04010, 2021.
  54. A three-regime model of network pruning. In International Conference on Machine Learning, pages 42790–42809. PMLR, 2023.
  55. To prune, or not to prune: exploring the efficacy of pruning for model compression, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Weiyu Huang (13 papers)
  2. Guohao Jian (1 paper)
  3. Yuezhou Hu (4 papers)
  4. Jun Zhu (424 papers)
  5. Jianfei Chen (63 papers)
Citations (2)