Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization (2405.10616v1)

Published 17 May 2024 in cs.CL and cs.LG

Abstract: In recent years, LLMs have driven advances in natural language processing. Still, their growing scale has increased the computational burden, necessitating a balance between efficiency and performance. Low-rank compression, a promising technique, reduces non-essential parameters by decomposing weight matrices into products of two low-rank matrices. Yet, its application in LLMs has not been extensively studied. The key to low-rank compression lies in low-rank factorization and low-rank dimensions allocation. To address the challenges of low-rank compression in LLMs, we conduct empirical research on the low-rank characteristics of large models. We propose a low-rank compression method suitable for LLMs. This approach involves precise estimation of feature distributions through pooled covariance matrices and a Bayesian optimization strategy for allocating low-rank dimensions. Experiments on the LLaMA-2 models demonstrate that our method outperforms existing strong structured pruning and low-rank compression techniques in maintaining model performance at the same compression ratio.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Fluctuation-based adaptive structured pruning for large language models.
  2. Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. In Advances in Neural Information Processing Systems.
  3. Matan Ben Noach and Yoav Goldberg. 2020. Compressing pre-trained language models by matrix decomposition. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 884–889, Suzhou, China. Association for Computational Linguistics.
  4. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34:7432–7439.
  5. Sparks of artificial general intelligence: Early experiments with gpt-4.
  6. Drone: Data-aware low-rank compression for large nlp models. In Advances in Neural Information Processing Systems, volume 34, pages 29321–29334. Curran Associates, Inc.
  7. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
  8. Think you have solved question answering? try arc, the ai2 reasoning challenge.
  9. Attention is not all you need: Pure attention loses rank doubly exponentially with depth.
  10. Elias Frantar and Dan Alistarh. 2023a. SparseGPT: Massive language models can be accurately pruned in one-shot. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10323–10337. PMLR.
  11. Elias Frantar and Dan Alistarh. 2023b. SparseGPT: Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774.
  12. OPTQ: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations.
  13. MiniLLM: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations.
  14. Training compute-optimal large language models.
  15. Language model compression with weighted low-rank factorization. In International Conference on Learning Representations.
  16. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  17. Numerical optimizations for weighted low-rank estimation on language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1404–1416, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  18. Scaling laws for neural language models.
  19. Lord: Low rank decomposition of monolingual code llms for one-shot compression.
  20. ELoRA: Efficient low-rank adaptation with random matrices. In The Twelfth International Conference on Learning Representations.
  21. LoSparse: Structured compression of large language models based on low-rank and sparse approximation. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 20336–20350. PMLR.
  22. Awq: Activation-aware weight quantization for llm compression and acceleration.
  23. Llm-qat: Data-free quantization aware training for large language models.
  24. LLM-pruner: On the structural pruning of large language models. In Thirty-seventh Conference on Neural Information Processing Systems.
  25. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330.
  26. Pointer sentinel mixture models.
  27. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium. Association for Computational Linguistics.
  28. OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. Open AI, blog.
  29. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  30. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  31. Linear pooling of sample covariance matrices. IEEE Transactions on Signal Processing, 70:659–672.
  32. Siyu Ren and Kenny Q. Zhu. 2023. Low-rank prune-and-factorize for language model compression.
  33. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106.
  34. The truth is in there: Improving reasoning in language models with layer-selective rank reduction.
  35. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.
  36. A simple and effective pruning approach for large language models. In The Twelfth International Conference on Learning Representations.
  37. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  38. Llama 2: Open foundation and fine-tuned chat models.
  39. Efficient large language models: A survey.
  40. B. P. Welford. 1962. Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3):419–420.
  41. Lower bounds on the worst-case complexity of efficient global optimization.
  42. The dawn of lmms: Preliminary explorations with gpt-4v(ision).
  43. Hao Yu and Jianxin Wu. 2023. Compressing transformers: Features are low-rank, but weights are not! Proceedings of the AAAI Conference on Artificial Intelligence, 37(9):11007–11015.
  44. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yixin Ji (13 papers)
  2. Yang Xiang (187 papers)
  3. Juntao Li (89 papers)
  4. Wei Chen (1288 papers)
  5. Zhongyi Liu (19 papers)
  6. Kehai Chen (59 papers)
  7. Min Zhang (630 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets