Basis Selection: Low-Rank Decomposition of Pretrained Large Language Models for Target Applications (2405.15877v1)
Abstract: LLMs significantly enhance the performance of various applications, but they are computationally intensive and energy-demanding. This makes it challenging to deploy them on devices with limited resources, such as personal computers and mobile/wearable devices, and results in substantial inference costs in resource-rich environments like cloud servers. To extend the use of LLMs, we introduce a low-rank decomposition approach to effectively compress these models, tailored to the requirements of specific applications. We observe that LLMs pretrained on general datasets contain many redundant components not needed for particular applications. Our method focuses on identifying and removing these redundant parts, retaining only the necessary elements for the target applications. Specifically, we represent the weight matrices of LLMs as a linear combination of base components. We then prune the irrelevant bases and enhance the model with new bases beneficial for specific applications. Deep compression results on the Llama 2-7b and -13B models, conducted on target applications including mathematical reasoning and code generation, show that our method significantly reduces model size while maintaining comparable accuracy to state-of-the-art low-rank compression techniques.
- Online Embedding Compression for Text Classification Using Low Rank Matrix Factorization. In AAAI, 2019.
- Program Synthesis with Large Language Models. arXiv preprint arXiv:2108.07732, 2021.
- Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374, 2021.
- GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking. In NeurIPS, 2018.
- DRONE: Data-Aware Low-Rank Compression for Large NLP Models. In NeurIPS, 2021.
- Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168, 2021.
- Exploiting Linear Structure within Convolutional Networks for Efficient Evaluation. NeurIPS, 2014.
- Information Flow Routes: Automatically Interpreting Language Models at Scale. arXiv preprint arXiv:2403.00824, 2024.
- Matrix Computations. Johns Hopkins University Press, Baltimore and London, 1996.
- Measuring Mathematical Problem Solving with the MATH Dataset. NeurIPS, 2021.
- Language Model Compression with Weighted Low-Rank Factorization. In ICLR, 2022.
- LoRA: Low-Rank Adaptation of Large Language Models. In ICLR, 2022.
- Speeding Up Convolutional Neural Networks with Low Rank Expansions. arXiv preprint arXiv:1405.3866, 2014.
- Compressing pre-trained language models by matrix decomposition. In AACL/IJCNLP, 2020.
- Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks. In INTERSPEECH, 2018.
- The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction. In ICLR, 2024.
- Llama 2: Open Foundation and Finetuned Chat Models. arXiv preprint arXiv:2307.09288, 2023.
- Restructuring of Deep Neural Network Acoustic Models with Singular Value Decomposition. In INTERSPEECH, 2013.
- Compressing Transformers: Features Are Low-Rank, But Weights Are Not! In AAAI, 2023.
- Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10):1943–1955, 2015.