Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers (2309.16119v2)

Published 28 Sep 2023 in cs.LG and cs.AI

Abstract: We propose a memory-efficient finetuning algorithm for LLMs that supports finetuning LLMs with 65B parameters in 2/3/4-bit precision on as little as one 24GB GPU. Our method, modular low-rank adaptation (ModuLoRA), integrates any user-specified weight quantizer with finetuning via low-rank adapters (LoRAs). Our approach relies on a simple quantization-agnostic backward pass that adaptively materializes low-precision LLM weights from a custom black-box quantization module. This approach enables finetuning 2-bit and 3-bit LLMs for the first time -- leveraging state-of-the-art 2-bit QuIP# quantization and 3-bit OPTQ quantization -- outperforming finetuning that relies on less sophisticated 4-bit and 8-bit methods. In our experiments, \lplora~attains competitive performance on text classification, natural language inference, and instruction following tasks using significantly less memory than existing approaches, and we also surpass the state-of-the-art ROUGE score on a popular summarization task. We release \lplora~together with a series of low-precision models as part of \LLMtune, a user-friendly library for quantizing, running, and finetuning LLMs on consumer GPUs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Input-tuning: Adapting unfamiliar inputs to frozen pretrained models. arXiv preprint arXiv:2203.03131, 2022.
  2. Language models are few-shot learners. In Conference on Neural Information Processing Systems, 2020.
  3. Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
  4. Quip: 2-bit quantization of large language models with guarantees, 2023.
  5. Instructeval: Towards holistic evaluation of instruction-tuned large language models. arXiv preprint arXiv:2306.04757, 2023.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  7. Llm.int8(): 8-bit matrix multiplication for transformers at scale. In Conference on Neural Information Processing Systems, 2022.
  8. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  10. Hawq: Hessian aware quantization of neural networks with mixed-precision. In International Conference on Computer Vision, 2019.
  11. Hawq-v2: Hessian aware trace-weighted quantization of neural networks. In Conference on Neural Information Processing Systems, 2020.
  12. Optq: Accurate quantization for generative pre-trained transformers. In International Conference on Learning Representations, 2023.
  13. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization, pp.  70–79, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-5409. URL https://aclanthology.org/D19-5409.
  14. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp.  2790–2799. PMLR, 2019.
  15. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022.
  16. Accurate post training quantization with small calibration sets. In International Conference on Machine Learning. PMLR, 2021.
  17. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035, 2021.
  18. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  3045–3059, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.243. URL https://aclanthology.org/2021.emnlp-main.243.
  19. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4582–4597, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353. URL https://aclanthology.org/2021.acl-long.353.
  20. Brecq: Pushing the limit of post-training quantization by block reconstruction. In International Conference on Learning Representations, 2021.
  21. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022a.
  22. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023a.
  23. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  61–68, 2022b.
  24. Gpt understands, too. AI Open, 2023b.
  25. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  26. Up or down? adaptive rounding for post-training quantization. In International Conference on Machine Learning, pp.  7197–7206. PMLR, 2020.
  27. Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models. arXiv preprint arXiv:2206.09557, 2023.
  28. Learning how to ask: Querying lms with mixtures of soft prompts. arXiv preprint arXiv:2104.06599, 2021.
  29. Exploring the limits of transfer learning with a unified text-to-text transformer, 2020.
  30. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.  2001–2010, 2017.
  31. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2023.
  32. Training neural networks with fixed sparse masks. Advances in Neural Information Processing Systems, 34:24193–24205, 2021.
  33. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  34. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  35. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  36. Quip#: Quip with lattice codebooks. https://cornell-relaxml.github.io/quip-sharp/, 2023.
  37. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1112–1122. Association for Computational Linguistics, 2018. URL http://aclweb.org/anthology/N18-1101.
  38. Smoothquant: Accurate and efficient post-training quantization for large language models. arXiv preprint arXiv:2211.10438, 2023.
  39. Hawq-v3: Dyadic neural network quantization. In International Conference on Machine Learning. PMLR, 2021.
  40. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. In Conference on Neural Information Processing Systems, 2022.
  41. Rptq: Reorder-based post-training quantization for large language models. arXiv preprint arXiv:2304.01089, 2023.
  42. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
  43. Opt: Open pre-trained transformer language models, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Junjie Yin (17 papers)
  2. Jiahao Dong (11 papers)
  3. Yingheng Wang (16 papers)
  4. Christopher De Sa (77 papers)
  5. Volodymyr Kuleshov (45 papers)
Citations (4)

Summary

A Formal Overview of ModuLoRA: Fine-tuning 2-Bit LLMs with Modular Quantizers

The paper "ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers" presents an advanced algorithm for fine-tuning LLMs using consumer-grade hardware. The proposed approach, ModuLoRA, introduces a methodology that enables the fine-tuning of LLMs with 65 billion parameters at reduced precision (2/3/4-bits). This is achieved with minimal computational resources, exemplified by the use of a single 24GB GPU.

Methodological Innovation

ModuLoRA expands upon the existing Low-Rank Adaptation (LoRA) technique by integrating modular quantizers capable of supporting finer granularity in weight precision. The key innovation lies in the decoupling of quantization methods from the adaptation process, allowing ModuLoRA to leverage various user-specified quantizers. This provides significant flexibility and extends the adaptation process to lower precision formats without compromising computational stability and efficiency.

Technical Framework

  • Quantization-Agnostic Backward Pass: The method employs a quantization-agnostic backward pass designed to efficiently manage low-precision weights materialized from a black-box quantization module. This process enables ModuLoRA to effectively utilize advanced algorithms like QuIP# for 2-bit precision and OPTQ for 3-bit precision.
  • Efficiency on Consumer Hardware: Demonstrating the capability to finetune a 65B model on a single 24GB GPU, the paper highlights the potential for broader accessibility of LLM fine-tuning. This is crucial for democratizing model improvements and deploying solutions in environments where high-end hardware is unavailable.

Experimental Insights

ModuLoRA shows promising results across several benchmark tasks, exhibiting effective performance in:

  • Text Classification: Models fine-tuned with ModuLoRA achieve competitive accuracy, closely aligning with higher precision baselines while using significantly less memory.
  • Natural Language Inference: The approach surpasses existing 4-bit and 8-bit methods, maintaining high accuracy comparable to full precision models.
  • Abstractive Summarization: A state-of-the-art ROUGE score is achieved in the SAMSum dataset, underscoring the effectiveness of the approach for generating summaries.
  • Instruction Following: Noteworthy performance is demonstrated on the BigBenchHard benchmark, further indicating the robustness of ModuLoRA in diverse applications.

Practical Implications

The integration of ModuLoRA into the LLMTools library provides researchers and practitioners with a user-friendly toolset facilitating the quantization, fine-tuning, and deployment of LLMs on consumer hardware. This practical implementation is instrumental in advancing open-source model development and fostering scientific progress in constrained environments.

Theoretical Implications and Future Directions

This research contributes theoretically by challenging the perceived limitations of low-precision quantization in maintaining model robustness and performance. By demonstrating that competitive accuracy can be achieved with substantially reduced precision, the paper encourages further exploration into optimizing model architectures that use even fewer resources.

Future research could focus on extending ModuLoRA to larger-scale models, such as those containing trillions of parameters, exploring advanced quantization schemas, and addressing any latency considerations introduced by efficient mixed-precision computations.

In summary, the paper delineates a significant advancement in the field of efficient LLM fine-tuning, offering practical solutions for resource-constrained environments while maintaining high performance across critical NLP tasks.

X Twitter Logo Streamline Icon: https://streamlinehq.com