Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fluctuation-based Adaptive Structured Pruning for Large Language Models (2312.11983v1)

Published 19 Dec 2023 in cs.CL and cs.AI

Abstract: Network Pruning is a promising way to address the huge computing resource demands of the deployment and inference of LLMs. Retraining-free is important for LLMs' pruning methods. However, almost all of the existing retraining-free pruning approaches for LLMs focus on unstructured pruning, which requires specific hardware support for acceleration. In this paper, we propose a novel retraining-free structured pruning framework for LLMs, named FLAP (FLuctuation-based Adaptive Structured Pruning). It is hardware-friendly by effectively reducing storage and enhancing inference speed. For effective structured pruning of LLMs, we highlight three critical elements that demand the utmost attention: formulating structured importance metrics, adaptively searching the global compressed model, and implementing compensation mechanisms to mitigate performance loss. First, FLAP determines whether the output feature map is easily recoverable when a column of weight is removed, based on the fluctuation pruning metric. Then it standardizes the importance scores to adaptively determine the global compressed model structure. At last, FLAP adds additional bias terms to recover the output feature maps using the baseline values. We thoroughly evaluate our approach on a variety of language benchmarks. Without any retraining, our method significantly outperforms the state-of-the-art methods, including LLM-Pruner and the extension of Wanda in structured pruning. The code is released at https://github.com/CASIA-IVA-Lab/FLAP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. GPT4All: Training an Assistant-style Chatbot with Large Scale Data Distillation from GPT-3.5-Turbo. https://github.com/nomic-ai/gpt4all. Accessed: 2023-08-09.
  2. Structured pruning of deep convolutional neural networks. ACM Journal on Emerging Technologies in Computing Systems (JETC), 13(3): 1–18.
  3. PIQA: Reasoning about Physical Commonsense in Natural Language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
  4. GPT Takes the Bar Exam. arXiv:2212.14402.
  5. Language models are few-shot learners. arXiv:2005.14165.
  6. Sparks of Artificial General Intelligence: Early experiments with GPT-4. arXiv:2303.12712.
  7. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2924–2936. Minneapolis, Minnesota: Association for Computational Linguistics.
  8. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457v1.
  9. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. In Advances in Neural Information Processing Systems.
  10. SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression. arXiv:2306.03078.
  11. SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. arXiv:2301.00774.
  12. GPTQ: Accurate Post-training Compression for Generative Pretrained Transformers. In International Conference on Learning Representations.
  13. A framework for few-shot language model evaluation. Version v0. 0.1. Sept.
  14. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding. In International Conference on Learning Representations.
  15. Learning both weights and connections for efficient neural networks. In Advances in Neural Information Processing Systems.
  16. Optimal brain surgeon and general network pruning. In IEEE International Conference on Neural Networks.
  17. Structured Pruning for Deep Convolutional Neural Networks: A survey. arXiv:2303.00566.
  18. Distilling the knowledge in a neural network. arXiv:1503.02531.
  19. Lora: Low-rank adaptation of large language models. arXiv:2106.09685.
  20. Optimal brain damage. In Advances in Neural Information Processing Systems.
  21. LLM-Pruner: On the Structural Pruning of Large Language Models. Version 3, arXiv:2305.11627.
  22. Pointer Sentinel Mixture Models. arXiv:1609.07843.
  23. Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering. In EMNLP.
  24. Accelerating sparse deep neural networks. arXiv:2104.08378.
  25. Pruning Convolutional Neural Networks for Resource Efficient Inference. In International Conference on Learning Representations.
  26. Richards, T. B. 2023. Auto-GPT: An experimental open-source attempt to make GPT-4 fully autonomous. https://github.com/Significant-Gravitas/Auto-GPT. Accessed: 2023-08-09.
  27. WinoGrande: An Adversarial Winograd Schema Challenge at Scale. arXiv:1907.10641.
  28. Bloom: A 176b-parameter open-access multilingual language model. arXiv:2211.05100.
  29. Upop: Unified and progressive pruning for compressing vision-language transformers. arXiv:2301.13741.
  30. A Simple and Effective Pruning Approach for Large Language Models. arXiv:2306.11695.
  31. Dropnet: Reducing neural network complexity via iterative pruning. In International Conference on Machine Learning, 9356–9366. PMLR.
  32. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford˙alpaca. Accessed: 2023-08-09.
  33. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971.
  34. Emergent Abilities of Large Language Models. In Transactions on Machine Learning Research.
  35. Welford, B. 1962. Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3): 419–420.
  36. Structured Pruning Learns Compact and Accurate Models. In Association for Computational Linguistics (ACL).
  37. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. In International Conference on Machine Learning.
  38. Prune once for all: Sparse pre-trained language models. arXiv:2111.05754.
  39. HellaSwag: Can a Machine Really Finish Your Sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
  40. OPT: Open pre-trained transformer language models. arXiv:2205.01068.
  41. Learning n: m fine-grained structured sparse neural networks from scratch. arXiv:2102.04010.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yongqi An (5 papers)
  2. Xu Zhao (64 papers)
  3. Tao Yu (282 papers)
  4. Ming Tang (199 papers)
  5. Jinqiao Wang (76 papers)
Citations (19)

Summary

We haven't generated a summary for this paper yet.