LORD: Low Rank Decomposition Of Monolingual Code LLMs For One-Shot Compression (2309.14021v1)
Abstract: Low Rank Decomposition of matrix - splitting a large matrix into a product of two smaller matrix offers a means for compression that reduces the parameters of a model without sparsification, and hence delivering more speedup on modern hardware. Moreover, unlike quantization, the compressed linear layers remain fully differentiable and all the parameters trainable, while being able to leverage the existing highly efficient kernels over floating point matrices. We study the potential to compress LLMs for monolingual Code generation via Low Rank Decomposition (LoRD) and observe that ranks for the linear layers in these models can be reduced by upto 39.58% with less than 1% increase in perplexity. We then use Low Rank Decomposition (LoRD) to compress StarCoder 16B to 13.2B parameter with no drop and to 12.3B with minimal drop in HumanEval Pass@1 score, in less than 10 minutes on a single A100. The compressed models speeds up inference by up to 22.35% with just a single line of change in code over huggingface's implementation with pytorch backend. Low Rank Decomposition (LoRD) models remain compatible with state of the art near-lossless quantization method such as SpQR, which allows leveraging further compression gains of quantization. Lastly, QLoRA over Low Rank Decomposition (LoRD) model further reduces memory requirements by as much as 21.2% over vanilla QLoRA while offering similar gains from parameter efficient fine tuning. Our work shows Low Rank Decomposition (LoRD) as a promising new paradigm for LLM compression.
- Gkd: Generalized knowledge distillation for auto-regressive sequence models, 2023.
- Anton Bacaj. code-eval. https://github.com/abacaj/code-eval, July 2023.
- Adaptive input representations for neural language modeling. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=ByxZX20qFQ.
- Compressing pre-trained language models by matrix decomposition. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp. 884–889, Suzhou, China, December 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.aacl-main.88.
- Project Bigcode. The stack smol, 2022. URL https://huggingface.co/datasets/bigcode/the-stack-smol.
- An updated set of basic linear algebra subprograms (blas). ACM Transactions on Mathematical Software, 28(2):135–151, 2002.
- Team Cerebras. Creating sparse gpt-3 models with iterative pruning, 11 2022. URL https://www.cerebras.net/blog/creating-sparse-gpt-3-models-with-iterative-pruning.
- Sahil Chaudhary. Code instructions dataset. https://huggingface.co/datasets/sahil2801/code_instructions_120k, Jun 2023.
- Quip: 2-bit quantization of large language models with guarantees, 2023.
- Evaluating large language models trained on code, 2021a.
- Drone: Data-aware low-rank compression for large nlp models. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp. 29321–29334. Curran Associates, Inc., 2021b. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/f56de5ef149cf0aedcc8f4797031e229-Paper.pdf.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
- Flashattention: Fast and memory-efficient exact attention with IO-awareness. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=H4DqfPSibmx.
- The case for 4-bit precision: k-bit inference scaling laws, 2022.
- GPT3.int8(): 8-bit matrix multiplication for transformers at scale. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=dXiGWqBoxaD.
- Qlora: Efficient finetuning of quantized llms, 2023a.
- Spqr: A sparse-quantized representation for near-lossless llm weight compression, 2023b.
- Spqr: A sparse-quantized representation for near-lossless llm weight compression, 2023c.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
- Kronecker decomposition for GPT compression. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 219–226, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.24. URL https://aclanthology.org/2022.acl-short.24.
- Rank diminishing in deep neural networks. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=tIqzLFf3kk.
- Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023.
- OPTQ: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=tcbBPnfwxS.
- Knowledge distillation of large language models, 2023.
- Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016. doi: 10.1109/CVPR.2016.90.
- Distilling the knowledge in a neural network, 2015.
- Language model compression with weighted low-rank factorization. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=uPv9Y3gmAI5.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Numerical optimizations for weighted low-rank estimation on language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 1404–1416, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.91. URL https://aclanthology.org/2022.emnlp-main.91.
- Impossible distillation: from low-quality model to high-quality dataset & model for summarization and paraphrasing, 2023.
- Squeezellm: Dense-and-sparse quantization, 2023a.
- Finequant: Unlocking efficiency with fine-grained weight-only quantization for llms, 2023b.
- The stack: 3 tb of permissively licensed source code, 2022.
- Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1eA7AEtvS.
- Owq: Lessons learned from activation outliers for weight quantization in large language models, 2023.
- Norm tweaking: High-performance low-bit quantization of large language models, 2023a.
- Losparse: Structured compression of large language models based on low-rank and sparse approximation, 2023b.
- Train large, then compress: Rethinking model size for efficient training and inference of transformers. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
- Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing, 461:370–403, 2021. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2021.07.045. URL https://www.sciencedirect.com/science/article/pii/S0925231221010894.
- Awq: Activation-aware weight quantization for llm compression and acceleration, 2023.
- Llm-pruner: On the structural pruning of large language models, 2023.
- Corporation NVIDIA. Compute unified device architecture (cuda). Website, 2007. URL https://developer.nvidia.com/cuda-toolkit. Accessed: 2023-09-17.
- Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models, 2022.
- PyTorch: An Imperative Style, High-Performance Deep Learning Library, chapter ., pp. . Curran Associates Inc., Red Hook, NY, USA, 2019.
- The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023.
- The impact of ai on developer productivity: Evidence from github copilot, 2023.
- Self-attention does not need o(n2)𝑜superscript𝑛2o(n^{2})italic_o ( italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) memory, 2021.
- Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
- Low-rank prune-and-factorize for language model compression, 2023.
- Code llama: Open foundation models for code, 2023.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter, 2019.
- Omniquant: Omnidirectionally calibrated quantization for large language models, 2023.
- Noam Shazeer. Fast transformer decoding: One write-head is all you need, 2019.
- Noam Shazeer. Glu variants improve transformer, 2020.
- Pangu-coder2: Boosting large language models for code with ranking feedback, 2023.
- A simple and effective pruning approach for large language models, 2023a.
- Principle-driven self-alignment of language models from scratch with minimal human supervision, 2023b.
- KroneckerBERT: Significant compression of pre-trained language models through kronecker decomposition and knowledge distillation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2116–2127, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.154. URL https://aclanthology.org/2022.naacl-main.154.
- Alpaca: A strong, replicable instruction-following model. CRFM Stanford, March 2023. URL https://crfm.stanford.edu/2023/03/13/alpaca.html.
- Llama: Open and efficient foundation language models, 2023.
- A survey on large language model based autonomous agents, 2023a.
- How far can camels go? exploring the state of instruction tuning on open resources, 2023b.
- Outlier suppression: Pushing the limit of low-bit transformer language models. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=yW5zeRSFdZ.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45, Online, October 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-demos.6. URL https://aclanthology.org/2020.emnlp-demos.6.
- Zeroquant-fp: A leap forward in llms post-training w4a8 quantization using floating-point formats, 2023.
- SmoothQuant: Accurate and efficient post-training quantization for large language models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 38087–38099. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/xiao23c.html.
- Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation, 2023.
- Compressing transformers: Features are low-rank, but weights are not! Proceedings of the AAAI Conference on Artificial Intelligence, 37(9):11007–11015, Jun. 2023. doi: 10.1609/aaai.v37i9.26304. URL https://ojs.aaai.org/index.php/AAAI/article/view/26304.
- Rptq: Reorder-based post-training quantization for large language models, 2023.
- Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning, 2023.
- A survey on model compression for large language models, 2023.
- Ayush Kaushal (7 papers)
- Tejas Vaidhya (7 papers)
- Irina Rish (85 papers)