Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SparseLLM: Towards Global Pruning for Pre-trained Language Models (2402.17946v4)

Published 28 Feb 2024 in cs.CL

Abstract: The transformative impact of LLMs like LLaMA and GPT on natural language processing is countered by their prohibitive computational demands. Pruning has emerged as a pivotal compression strategy, introducing sparsity to enhance both memory and computational efficiency. Yet, traditional global pruning is impractical for LLMs due to scalability issues, while local pruning, despite its efficiency, leads to suboptimal solutions. Addressing these challenges, we propose SparseLLM, a novel framework that redefines the global pruning process into manageable, coordinated subproblems, allowing for resource-efficient optimization with global optimality. SparseLLM's approach, which conceptualizes LLMs as a chain of modular functions and leverages auxiliary variables for problem decomposition, not only facilitates a pragmatic application on LLMs but also demonstrates significant performance improvements, particularly in high-sparsity regimes where it surpasses current state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Beyond efficiency: A systematic survey of resource-efficient large language models. arXiv preprint arXiv:2401.00625.
  2. Thomas Blumensath and Mike E Davies. 2008. Iterative thresholding for sparse approximations. Journal of Fourier analysis and Applications, 14:629–654.
  3. Michael Bommarito II and Daniel Martin Katz. 2022. Gpt takes the bar exam. arXiv preprint arXiv:2212.14402.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  5. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
  6. Tim Dettmers and Luke Zettlemoyer. 2023. The case for 4-bit precision: k-bit inference scaling laws. In International Conference on Machine Learning, pages 7750–7774. PMLR.
  7. Elias Frantar and Dan Alistarh. 2023. Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774.
  8. A framework for few-shot language model evaluation.
  9. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.
  10. Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks. The Journal of Machine Learning Research, 22(1):10882–11005.
  11. Accelerated sparse neural training: A provable and efficient method to find n: m transposable masks. Advances in neural information processing systems, 34:21099–21111.
  12. Jason Kingdon and Jason Kingdon. 1997. Hypothesising neural nets. Intelligent Systems and Financial Forecasting, pages 81–106.
  13. Ziplm: Hardware-aware structured pruning of language models. arXiv preprint arXiv:2302.04089.
  14. A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35:24101–24116.
  15. Optimal brain damage. Advances in neural information processing systems, 2.
  16. Losparse: Structured compression of large language models based on low-rank and sparse approximation. arXiv preprint arXiv:2306.11222.
  17. Domain specialization as the key to make large language models disruptive: A comprehensive survey. arXiv preprint arXiv:2305.18703, 2305.
  18. Ffsplit: Split feed-forward network for optimizing accuracy-efficiency trade-off in language model inference. arXiv preprint arXiv:2401.04044.
  19. Llm-pruner: On the structural pruning of large language models. arXiv preprint arXiv:2305.11627.
  20. Arun Mallya and Svetlana Lazebnik. 2018. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 7765–7773.
  21. The penn treebank: Annotating predicate argument structure. In Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994.
  22. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.
  23. R OpenAI. 2023. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2:13.
  24. nuqmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557.
  25. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  26. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  27. Sidak Pal Singh and Dan Alistarh. 2020. Woodfisher: Efficient second-order approximation for neural network compression. Advances in Neural Information Processing Systems, 33:18098–18109.
  28. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model.
  29. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.
  30. Ecoflap: Efficient coarse-to-fine layer-wise pruning for vision-language models. arXiv preprint arXiv:2310.02998.
  31. Structured pruning for efficient generative pre-trained language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 10880–10895.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  33. Shikhar Tuli and Niraj K Jha. 2023. Acceltran: A sparsity-aware accelerator for dynamic inference with transformers. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.
  34. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  35. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  36. Canwen Xu and Julian McAuley. 2023. A survey on model compression and acceleration for pretrained language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 10566–10575.
  37. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183.
  38. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  39. Michael Zhu and Suyog Gupta. 2017. To prune, or not to prune: exploring the efficacy of pruning for model compression. arXiv preprint arXiv:1710.01878.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Guangji Bai (24 papers)
  2. Yijiang Li (36 papers)
  3. Chen Ling (65 papers)
  4. Kibaek Kim (43 papers)
  5. Liang Zhao (353 papers)
Citations (3)