Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SVD-LLM: Truncation-aware Singular Value Decomposition for Large Language Model Compression (2403.07378v5)

Published 12 Mar 2024 in cs.CL and cs.LG

Abstract: The advancements in LLMs have been hindered by their substantial sizes, which necessitates LLM compression methods for practical deployment. Singular Value Decomposition (SVD) offers a promising solution for LLM compression. However, state-of-the-art SVD-based LLM compression methods have two key limitations: truncating smaller singular values may lead to higher compression loss, and the lack of update on the compressed weights after SVD truncation. In this work, we propose SVD-LLM, a SVD-based post-training LLM compression method that addresses the limitations of existing methods. SVD-LLM incorporates a truncation-aware data whitening technique to ensure a direct mapping between singular values and compression loss. Moreover, SVD-LLM adopts a parameter update with sequential low-rank approximation to compensate for the accuracy degradation after SVD compression. We evaluate SVD-LLM on 10 datasets and seven models from three different LLM families at three different scales. Our results demonstrate the superiority of SVD-LLM over state-of-the-arts, especially at high model compression ratios. Our code is available at https://github.com/AIoT-MLSys-Lab/SVD-LLM

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. A large annotated corpus for learning natural language inference. In Màrquez, L., Callison-Burch, C., and Su, J. (eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  632–642, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. URL https://aclanthology.org/D15-1075.
  2. Language models are few-shot learners, 2020.
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  4. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  5. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), 2005. URL https://aclanthology.org/I05-5002.
  6. SparseGPT: Massive language models can be accurately pruned in one-shot. arXiv preprint arXiv:2301.00774, 2023.
  7. GPTQ: Accurate post-training compression for generative pretrained transformers. arXiv preprint arXiv:2210.17323, 2022.
  8. A generalization of the eckart-young-mirsky matrix approximation theorem. Linear Algebra and its Applications, 88-89:317–327, 1987. ISSN 0024-3795. doi: https://doi.org/10.1016/0024-3795(87)90114-5. URL https://www.sciencedirect.com/science/article/pii/0024379587901145.
  9. A survey of generative ai applications, 2023.
  10. Knowledge distillation of large language models, 2023.
  11. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301, 2023.
  12. Language model compression with weighted low-rank factorization, 2022.
  13. Lora: Low-rank adaptation of large language models, 2021.
  14. Mistral 7b, 2023.
  15. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv, 2023.
  16. Llm-pruner: On the structural pruning of large language models. In Advances in Neural Information Processing Systems, 2023.
  17. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993. URL https://aclanthology.org/J93-2004.
  18. Pointer sentinel mixture models, 2016.
  19. Meyer, C. D. Matrix analysis and applied linear algebra, volume 188. Siam, 2023.
  20. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21(1), jan 2020. ISSN 1532-4435.
  21. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, pp.  1631–1642. ACL, 2013.
  22. Llama: Open and efficient foundation language models, 2023a.
  23. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  24. Efficient large language models: A survey, 2023.
  25. Iot in the era of generative ai: Vision and challenges, 2024.
  26. Neural network acceptability judgments. Trans. Assoc. Comput. Linguistics, 7:625–641, 2019.
  27. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT, pp.  1112–1122. Association for Computational Linguistics, 2018.
  28. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023.
  29. Asvd: Activation-aware singular value decomposition for compressing large language models, 2023.
  30. A survey of large language models, 2023.
  31. A survey on model compression for large language models, 2023.
Citations (23)

Summary

  • The paper presents SVD-LLM, which leverages truncation-aware SVD with data whitening to accurately discard singular values while managing compression loss.
  • It updates the left singular vectors layer-wise post-truncation to maintain performance even with high compression ratios.
  • Experiments show SVD-LLM reduces perplexity by up to 99% and compresses models up to ten times faster than previous methods.

SVD-LLM: Truncation-aware Singular Value Decomposition for LLM Compression

The paper addresses the pressing need for efficient compression techniques tailored for LLMs, which, despite their impressive capabilities, present significant deployment challenges due to their size and computational demands. Singular Value Decomposition (SVD)-based methods emerge as a promising approach to these challenges, offering lightweight alternatives to more resource-intensive methodologies like quantization or pruning.

Methodology and Contributions

The authors propose a novel compression method termed SVD-LLM, which focuses on resolving existing limitations in SVD-based LLM compression methods, particularly ASVD and FWSVD. These existing methods either inadequately address the relationship between singular value magnitudes and compression loss or neglect the importance of parameter updates post-truncation. The significant innovations presented in SVD-LLM include:

  1. Truncation-Aware Data Whitening:
    • This method involves preprocessing activation data through Cholesky decomposition to ensure orthogonality in the input channels. Such preprocessing allows for a straightforward determination of which singular values can be discarded with minimal impact on compression error. The intuitive guiding principle is that each singular value equates directly to a quantifiable portion of compression loss.
  2. Layer-Wise Closed-Form Model Parameter Update:
    • Post-truncation of singular values, the authors propose updating only the left singular vectors in a way that both respects the low-rank approximation of the original matrix and compensates for the performance drop typically seen with high compression ratios. This layer-wise approach allows for fine-tuning at individual layers instead of holistic model adjustments, which is efficient and computationally manageable.

Experimental Evaluation

SVD-LLM's effectiveness is demonstrated extensively using a series of benchmarks and scenarios involving eight different models from LLaMA, OPT, and Mistral families. The experimentation undertakes various compression ratios ranging from 20% to 60%, with SVD-LLM consistently surpassing baseline methods (SVD, FWSVD, ASVD) in performance. The highlights include:

  • Perplexity Reduction: Achieving a 99% reduction in perplexity relative to conventional SVD methods in scenarios with higher compression ratios—indicative of significant performance retention.
  • Computational Efficiency: The proposed method is significantly quicker, demonstrating a compression process tenfold more efficient than previous approaches like ASVD. Particularly it compresses LLaMA-7B in just 15 minutes compared to ASVD's 5.5 hours.
  • Scalability: SVD-LLM shows prowess not only on smaller 7B models but also on larger scales, including 13B, 30B, and 65B variants, demonstrating broad applicability across model sizes.

Implications and Future Directions

The practical implications of SVD-LLM are clear: it offers a scalable, efficient solution for LLM compression that significantly facilitates deploying these models on resource-constrained environments. Such compression can democratize access to AI capabilities by enabling LLM deployment on edge devices or personal computers without the accompanying prohibitive computational costs.

From a theoretical perspective, SVD-LLM provides insights into how structured matrix decompositions can be leveraged beyond traditional linear algebra applications, opening new avenues in model optimization and compression techniques.

Looking ahead, SVD-LLM could serve as a foundation for refining additional model compression techniques, possibly synergizing with existing quantization or pruning strategies to push the envelope in terms of performance and efficiency further. Investigating such hybrid methodologies might be fertile ground for future research, especially in contexts where both speed and precision are paramount.

Overall, SVD-LLM makes significant strides in LLM compression, setting a new direction for research and application in this area.

HackerNews