Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models (2310.08659v4)

Published 12 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Quantization is an indispensable technique for serving LLMs and has recently found its way into LoRA fine-tuning. In this work we focus on the scenario where quantization and LoRA fine-tuning are applied together on a pre-trained model. In such cases it is common to observe a consistent gap in the performance on downstream tasks between full fine-tuning and quantization plus LoRA fine-tuning approach. In response, we propose LoftQ (LoRA-Fine-Tuning-aware Quantization), a novel quantization framework that simultaneously quantizes an LLM and finds a proper low-rank initialization for LoRA fine-tuning. Such an initialization alleviates the discrepancy between the quantized and full-precision model and significantly improves generalization in downstream tasks. We evaluate our method on natural language understanding, question answering, summarization, and natural language generation tasks. Experiments show that our method is highly effective and outperforms existing quantization methods, especially in the challenging 2-bit and 2/4-bit mixed precision regimes. The code is available on https://github.com/yxli2123/LoftQ.

Introduction to LoftQ

Quantization is a vital step in deploying LLMs, crucially optimizing them for limited-resource scenarios without sacrificing performance. The paper presented addresses the challenges of combining quantization with Low-Rank Adaptation (LoRA) fine-tuning, unveiling LoftQ, a novel framework that promises effective LLM quantization.

Problem with Current Quantization Practices

Quantization significantly reduces the size of LLMs, converting high-precision numbers into more compact formats. When paired with LoRA fine-tuning, quantization traditionally follows a straightforward process that unfortunately overlooks the impacts on the initialization phase of fine-tuning, leading to performance gaps on downstream tasks. Previous methods such as QLoRA have particularly struggled under stringent conditions like the 2-bit regime, where the models' performance noticeably declines.

Introducing LoftQ: A New Approach

LoftQ is tailored to address these low-precision challenges by integrating low-rank approximation into the quantization process, jointly refining the approximation of original pre-trained weights and their quantized counterparts. This innovation aims to bridge the gap between the quantized start point and the fully trained model, fostering better generalization in downstream applications.

Empirical Validation and Results

To substantiate the efficacy of LoftQ, the researchers conducted extensive experiments across a diverse range of language tasks, including natural language understanding, question answering, summarization, and generation tasks. LoftQ consistently surpassed previous methods, particularly highlighting its prowess in scenarios involving 2-bit and mixed 2/4-bit precision—a significant achievement demonstrating its robust potential in task-specific model adaptations within the low-bit spectrum. The results are particularly promising as they didn't just match but sometimes exceeded full-precision baselines.

Through a series of benchmarks on models such as DeBERTaV3-base, BART-large, and the LLAMA-2 series, LoftQ showcased an impressive capability to converge to acceptable performance levels, where its counterpart, QLoRA, could not.

Conclusion and Implications

LoftQ offers a compelling solution to a complex problem—the efficient and effective quantization of LLMs to suit resource-constrained deployment without notable losses in performance. Its ability to fine-tune in low-bit environments without deteriorating results sets a new standard for LLM quantization frameworks. As the demand for deploying LLMs continues to rise in various computational environments, LoftQ could play a crucial role in democratizing access to advanced LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Towards efficient post-training quantization of pre-trained language models. Advances in Neural Information Processing Systems, 35 1405–1418.
  2. Binarybert: Pushing the limit of bert quantization. arXiv preprint arXiv:2012.15701.
  3. The second pascal recognising textual entailment challenge.
  4. The fifth pascal recognizing textual entailment challenge. In TAC.
  5. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017). Association for Computational Linguistics, Vancouver, Canada.
  6. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  7. The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop.
  8. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339.
  9. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  10. Lmflow: An extensible toolkit for finetuning and inference of large foundation models. arXiv preprint arXiv:2306.12420.
  11. Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005).
  12. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323.
  13. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Association for Computational Linguistics, Prague.
  14. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366.
  15. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543.
  16. Teaching machines to read and comprehend. Advances in neural information processing systems, 28.
  17. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  18. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning.
  19. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
  20. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online.
  21. Losparse: Structured compression of large language models based on low-rank and sparse approximation. arXiv preprint arXiv:2306.11222.
  22. Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain.
  23. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888.
  24. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  25. Pointer sentinel mixture models.
  26. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. ArXiv, abs/1808.08745.
  27. Adversarial nli: A new benchmark for natural language understanding. ArXiv, abs/1910.14599. https://api.semanticscholar.org/CorpusID:207756753
  28. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32. Curran Associates, Inc., 8024–8035.
  29. Deploying quantization-aware trained networks using tensorrt. In GPU Technology Conference.
  30. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas.
  31. Q-bert: Hessian based ultra low precision quantization of bert. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34.
  32. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, USA.
  33. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  34. Attention is all you need. Advances in neural information processing systems, 30.
  35. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
  36. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7 625–641.
  37. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Association for Computational Linguistics, New Orleans, Louisiana.
  38. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning. PMLR.
  39. Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS). IEEE.
  40. Fixup initialization: Residual learning without normalization. arXiv preprint arXiv:1901.09321.
  41. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yixiao Li (14 papers)
  2. Yifan Yu (18 papers)
  3. Chen Liang (140 papers)
  4. Pengcheng He (60 papers)
  5. Nikos Karampatziakis (28 papers)
  6. Weizhu Chen (128 papers)
  7. Tuo Zhao (131 papers)
Citations (93)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com