Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Compressing Large Language Models using Low Rank and Low Precision Decomposition (2405.18886v2)

Published 29 May 2024 in cs.LG, cs.AI, math.OC, and stat.ML

Abstract: The prohibitive sizes of LLMs today make it difficult to deploy them on memory-constrained edge devices. This work introduces $\rm CALDERA$ -- a new post-training LLM compression algorithm that harnesses the inherent low-rank structure of a weight matrix $\mathbf{W}$ by approximating it via a low-rank, low-precision decomposition as $\mathbf{W} \approx \mathbf{Q} + \mathbf{L}\mathbf{R}$. Here, $\mathbf{L}$ and $\mathbf{R}$ are low rank factors, and the entries of $\mathbf{Q}$, $\mathbf{L}$ and $\mathbf{R}$ are quantized. The model is compressed by substituting each layer with its $\mathbf{Q} + \mathbf{L}\mathbf{R}$ decomposition, and the zero-shot performance of the compressed model is evaluated. Additionally, $\mathbf{L}$ and $\mathbf{R}$ are readily amenable to low-rank adaptation, consequently enhancing the zero-shot performance. $\rm CALDERA$ obtains this decomposition by formulating it as an optimization problem $\min_{\mathbf{Q},\mathbf{L},\mathbf{R}}\lVert(\mathbf{Q} + \mathbf{L}\mathbf{R} - \mathbf{W})\mathbf{X}\top\rVert_{\rm F}2$, where $\mathbf{X}$ is the calibration data, and $\mathbf{Q}, \mathbf{L}, \mathbf{R}$ are constrained to be representable using low-precision formats. Theoretical upper bounds on the approximation error of $\rm CALDERA$ are established using a rank-constrained regression framework, and the tradeoff between compression ratio and model performance is studied by analyzing the impact of target rank and quantization bit budget. Results illustrate that compressing LlaMa-$2$ $7$B/$13B$/$70$B and LlaMa-$3$ $8$B models using $\rm CALDERA$ outperforms existing post-training LLM compression techniques in the regime of less than $2.5$ bits per parameter. The implementation is available at: https://github.com/pilancilab/caldera.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. The Fifth PASCAL Recognizing Textual Entailment Challenge, 2009.
  2. PIQA: Reasoning about Physical Commonsense in Natural Language. In Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020.
  3. QuIP: 2-Bit Quantization of Large Language Models With Guarantees. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  4. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. arXiv:1803.05457v1, 2018.
  5. QLoRA: Efficient Finetuning of Quantized LLMs. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=OUIFPHEgJU.
  6. Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus, 2021.
  7. Extreme Compression of Large Language Models via Additive Quantization, 2024. URL https://arxiv.org/abs/2401.06118.
  8. OPTQ: Accurate Quantization for Generative Pre-trained Transformers. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=tcbBPnfwxS.
  9. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  10. G. H. Golub and C. F. van Loan. Matrix Computations. JHU Press, fourth edition, 2013. ISBN 1421407949 9781421407944. URL http://www.cs.cornell.edu/cv/GVL4/golubandvanloan.htm.
  11. R. Gray and T. Stockham. Dithered quantizers. IEEE Transactions on Information Theory, 39(3):805–812, 1993. doi: 10.1109/18.256489.
  12. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate, 2022.
  13. LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning. arxiv:2311.12023, 2023. URL https://arxiv.org/abs/2311.12023.
  14. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288, 2011. doi: 10.1137/090771806. URL https://doi.org/10.1137/090771806.
  15. Language model compression with weighted low-rank factorization. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=uPv9Y3gmAI5.
  16. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  17. How Good Are Low-bit Quantized LLaMA3 Models? An Empirical Study, 2024.
  18. LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression, 2023.
  19. WinoGrande: An Adversarial Winograd Schema Challenge at Scale, 2019.
  20. A. Krishnamoorthy and D. Menon. Matrix inversion using cholesky decomposition. In 2013 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), pages 70–72, 2013.
  21. The winograd schema challenge. In Proceedings of the Thirteenth International Conference on Principles of Knowledge Representation and Reasoning, KR’12, page 552–561. AAAI Press, 2012. ISBN 9781577355601.
  22. LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models. arxiv:2310.08659, 2023. URL https://arxiv.org/abs/2310.08659.
  23. Dora: Weight-decomposed low-rank adaptation, 2024.
  24. The era of 1-bit llms: All large language models are in 1.58 bits, 2024.
  25. Pointer Sentinel Mixture Models, 2016.
  26. Meta AI. Introducing Meta Llama 3: The most capable openly available LLM to date. https://ai.meta.com/blog/meta-llama-3/, 2024. Accessed: 2024-05-07.
  27. Up or Down? Adaptive Rounding for Post-Training Quantization. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pages 7197–7206, 2020.
  28. Zero: Memory optimizations toward training trillion parameter models, 2020.
  29. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450379984. doi: 10.1145/3394486.3406703. URL https://doi.org/10.1145/3394486.3406703.
  30. Matrix Compression via Randomized Low Rank and Low Precision Factorization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=rxsCTtkqA9.
  31. L. Schuchman. Dither Signals and Their Effect on Quantization Noise. IEEE Transactions on Communication Technology, 12(4):162–165, 1964. doi: 10.1109/TCOM.1964.1088973.
  32. The Truth is in There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction, 2023.
  33. Together Computer. Redpajama: an open dataset for training large language models, October 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  34. Llama 2: Open Foundation and Fine-Tuned Chat Models, 2023.
  35. QuIP#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks, 2024.
  36. M. Udell and A. Townsend. Why are big data matrices approximately low rank? SIAM Journal on Mathematics of Data Science, 1(1):144–160, 2019. doi: 10.1137/18M1183480. URL https://doi.org/10.1137/18M1183480.
  37. R. Vershynin. High-Dimensional Probability: An Introduction with Applications in Data Science. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2018. doi: 10.1017/9781108231596.
  38. M. Viazovska. The sphere packing problem in dimension 8888. Annals of Mathematics, 185(3), May 2017. ISSN 0003-486X. doi: 10.4007/annals.2017.185.3.7. URL http://dx.doi.org/10.4007/annals.2017.185.3.7.
  39. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding, 2019. In the Proceedings of ICLR.
  40. Optimal exact least squares rank minimization. In Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’12, page 480–488, New York, NY, USA, 2012. Association for Computing Machinery. ISBN 9781450314626. doi: 10.1145/2339530.2339609. URL https://doi.org/10.1145/2339530.2339609.
  41. SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. In Proceedings of the 40th International Conference on Machine Learning, 2023.
  42. ZeroQuant-V2: Exploring Post-training Quantization in LLMs from Comprehensive Study to Low Rank Compensation, 2023.
  43. ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models, 2023.
  44. LQER: Low-Rank Quantization Error Reconstruction for LLMs, 2024.
Citations (1)

Summary

We haven't generated a summary for this paper yet.