Feature-based Low-Rank Compression of Large Language Models via Bayesian Optimization (2405.10616v1)
Abstract: In recent years, LLMs have driven advances in natural language processing. Still, their growing scale has increased the computational burden, necessitating a balance between efficiency and performance. Low-rank compression, a promising technique, reduces non-essential parameters by decomposing weight matrices into products of two low-rank matrices. Yet, its application in LLMs has not been extensively studied. The key to low-rank compression lies in low-rank factorization and low-rank dimensions allocation. To address the challenges of low-rank compression in LLMs, we conduct empirical research on the low-rank characteristics of large models. We propose a low-rank compression method suitable for LLMs. This approach involves precise estimation of feature distributions through pooled covariance matrices and a Bayesian optimization strategy for allocating low-rank dimensions. Experiments on the LLaMA-2 models demonstrate that our method outperforms existing strong structured pruning and low-rank compression techniques in maintaining model performance at the same compression ratio.
- Fluctuation-based adaptive structured pruning for large language models.
- Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. In Advances in Neural Information Processing Systems.
- Matan Ben Noach and Yoav Goldberg. 2020. Compressing pre-trained language models by matrix decomposition. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 884–889, Suzhou, China. Association for Computational Linguistics.
- Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34:7432–7439.
- Sparks of artificial general intelligence: Early experiments with gpt-4.
- Drone: Data-aware low-rank compression for large nlp models. In Advances in Neural Information Processing Systems, volume 34, pages 29321–29334. Curran Associates, Inc.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
- Think you have solved question answering? try arc, the ai2 reasoning challenge.
- Attention is not all you need: Pure attention loses rank doubly exponentially with depth.
- Elias Frantar and Dan Alistarh. 2023a. SparseGPT: Massive language models can be accurately pruned in one-shot. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10323–10337. PMLR.
- Elias Frantar and Dan Alistarh. 2023b. SparseGPT: Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774.
- OPTQ: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations.
- MiniLLM: Knowledge distillation of large language models. In The Twelfth International Conference on Learning Representations.
- Training compute-optimal large language models.
- Language model compression with weighted low-rank factorization. In International Conference on Learning Representations.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Numerical optimizations for weighted low-rank estimation on language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1404–1416, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Scaling laws for neural language models.
- Lord: Low rank decomposition of monolingual code llms for one-shot compression.
- ELoRA: Efficient low-rank adaptation with random matrices. In The Twelfth International Conference on Learning Representations.
- LoSparse: Structured compression of large language models based on low-rank and sparse approximation. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 20336–20350. PMLR.
- Awq: Activation-aware weight quantization for llm compression and acceleration.
- Llm-qat: Data-free quantization aware training for large language models.
- LLM-pruner: On the structural pruning of large language models. In Thirty-seventh Conference on Neural Information Processing Systems.
- Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330.
- Pointer sentinel mixture models.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium. Association for Computational Linguistics.
- OpenAI. 2022. Chatgpt: Optimizing language models for dialogue. Open AI, blog.
- OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
- Linear pooling of sample covariance matrices. IEEE Transactions on Signal Processing, 70:659–672.
- Siyu Ren and Kenny Q. Zhu. 2023. Low-rank prune-and-factorize for language model compression.
- Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106.
- The truth is in there: Improving reasoning in language models with layer-selective rank reduction.
- A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695.
- A simple and effective pruning approach for large language models. In The Twelfth International Conference on Learning Representations.
- Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
- Llama 2: Open foundation and fine-tuned chat models.
- Efficient large language models: A survey.
- B. P. Welford. 1962. Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3):419–420.
- Lower bounds on the worst-case complexity of efficient global optimization.
- The dawn of lmms: Preliminary explorations with gpt-4v(ision).
- Hao Yu and Jianxin Wu. 2023. Compressing transformers: Features are low-rank, but weights are not! Proceedings of the AAAI Conference on Artificial Intelligence, 37(9):11007–11015.
- HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
- Yixin Ji (13 papers)
- Yang Xiang (187 papers)
- Juntao Li (89 papers)
- Wei Chen (1288 papers)
- Zhongyi Liu (19 papers)
- Kehai Chen (59 papers)
- Min Zhang (630 papers)