Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models (2310.05015v2)

Published 8 Oct 2023 in cs.AI

Abstract: Despite the remarkable success of LLMs, the massive size poses significant deployment challenges, particularly on resource-constrained hardware. While existing LLM compression methods focus on quantization, pruning remains relatively unexplored due to the high cost of training-based approaches and data collection challenges. One-shot pruning methods, although cost-effective and data-free, have become dominant in LLM pruning, but lead to performance decline under the structured pruning setting. In this work, we introduce a new paradigm for structurally pruning LLMs, called Compresso. Our approach, through the collaboration of the proposed resource-efficient pruning algorithm and the LLM itself, learns optimal pruning decisions during the training process. Compresso addresses the challenges of expensive training costs and data collection by incorporating Low-Rank Adaptation (LoRA) into the $L_0$ regularization during the instruction tuning process. Then, we further augment the pruning algorithm by introducing a collaborative prompt that fosters collaboration between the LLM and the pruning algorithm, significantly boosting the overall performance. To this end, Compresso prunes LLaMA-7B to 5.4B, maintaining original performance and even surpassing LLaMA-7B in reading comprehension by 2.62%. Extensive experiments demonstrate that Compresso significantly outperforms one-shot pruning baselines across various sparsity ratios, achieving up to 2.21%, 11.43%, 7.04%, and 4.81% higher scores on the commonsense reasoning, reading comprehension, MMLU, and BBH benchmarks, respectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Winogrande: An adversarial winograd schema challenge at scale. 2019.
  2. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  3. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, 2020.
  4. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023.
  5. Instructeval: Towards holistic evaluation of instruction-tuned large language models. arXiv preprint arXiv:2306.04757, 2023.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  7. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022a.
  8. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022b.
  9. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  10. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL, 2019.
  11. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. SparseGPT: Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774, 2023.
  14. GPTQ: Accurate post-training compression for generative pretrained transformers. arXiv preprint arXiv:2210.17323, 2022.
  15. A framework for few-shot language model evaluation, September 2021. URL https://doi.org/10.5281/zenodo.5371628.
  16. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  17. Compressing bert: Studying the effects of weight pruning on transfer learning, 2020.
  18. News summarization and evaluation in the era of gpt-3. arXiv preprint arXiv:2209.12356, 2022.
  19. Transkimmer: Transformer learns to layer-wise skim. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  7275–7286. Association for Computational Linguistics, 2022.
  20. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  21. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  22. Llm-adapters: An adapter family for parameter-efficient fine-tuning of large language models. arXiv preprint arXiv:2304.01933, 2023.
  23. Accelerated sparse neural training: A provable and efficient method to find n: m transposable masks. Advances in neural information processing systems, 34:21099–21111, 2021.
  24. Tinybert: Distilling bert for natural language understanding, 2020.
  25. I-bert: Integer-only bert quantization. arXiv preprint arXiv:2101.01321, 2021.
  26. Learned token pruning for transformers. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’22, pp.  784–794. Association for Computing Machinery, 2022. ISBN 9781450393850.
  27. Block pruning for faster transformers. In EMNLP, 2021.
  28. Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683, 2017.
  29. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942, 2019.
  30. Constraint-aware and ranking-distilled token pruning for efficient transformer inference. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’23, pp.  1280–1290, 2023.
  31. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  32. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023.
  33. Learning sparse neural networks through l0subscript𝑙0l_{0}italic_l start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT regularization. In International Conference on Learning Representations, 2018.
  34. Llm-pruner: On the structural pruning of large language models. 2023.
  35. Can a suit of armor conduct electricity? a new dataset for open book question answering. In EMNLP, 2018.
  36. Lsdsem 2017 shared task: The story cloze test. In Proceedings of the 2nd Workshop on Linking Models of Lexical, Sentential and Discourse-level Semantics, pp.  46–51, 2017.
  37. OpenAI. Gpt-4 technical report, 2023.
  38. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  39. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023.
  40. Channel permutations for n:m sparsity. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=WAO1STUPWPP.
  41. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  42. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  43. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. 2020a.
  44. Movement pruning: Adaptive sparsity by fine-tuning. In NeurIPS, 2020b.
  45. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations, 2022.
  46. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  47. Exploiting cloze-questions for few-shot text classification and natural language inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021.
  48. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020.
  49. Cyclical pruning for sparse neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2762–2771, 2022.
  50. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  51. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  52. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  53. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  54. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  55. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  56. Structured pruning of large language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2020.
  57. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  58. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  59. Structured pruning learns compact and accurate models. In Association for Computational Linguistics (ACL), 2022.
  60. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pp. 38087–38099. PMLR, 2023.
  61. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183, 2022.
  62. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
  63. Swiftpruner: Reinforced evolutionary pruning for efficient ad relevance. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pp.  3654–3663, 2022a.
  64. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022b.
  65. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
  66. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Song Guo (138 papers)
  2. Jiahang Xu (14 papers)
  3. Li Lyna Zhang (20 papers)
  4. Mao Yang (62 papers)
Citations (10)