ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers (2309.16119v2)

Published 28 Sep 2023 in cs.LG and cs.AI

Abstract: We propose a memory-efficient finetuning algorithm for LLMs that supports finetuning LLMs with 65B parameters in 2/3/4-bit precision on as little as one 24GB GPU. Our method, modular low-rank adaptation (ModuLoRA), integrates any user-specified weight quantizer with finetuning via low-rank adapters (LoRAs). Our approach relies on a simple quantization-agnostic backward pass that adaptively materializes low-precision LLM weights from a custom black-box quantization module. This approach enables finetuning 2-bit and 3-bit LLMs for the first time -- leveraging state-of-the-art 2-bit QuIP# quantization and 3-bit OPTQ quantization -- outperforming finetuning that relies on less sophisticated 4-bit and 8-bit methods. In our experiments, \lplora~attains competitive performance on text classification, natural language inference, and instruction following tasks using significantly less memory than existing approaches, and we also surpass the state-of-the-art ROUGE score on a popular summarization task. We release \lplora~together with a series of low-precision models as part of \LLMtune, a user-friendly library for quantizing, running, and finetuning LLMs on consumer GPUs.

References (43)

Authors (5)

Junjie Yin (17 papers)
Jiahao Dong (11 papers)
Yingheng Wang (16 papers)
Christopher De Sa (77 papers)
Volodymyr Kuleshov (45 papers)

Citations (4)

View on Semantic Scholar

Summary

A Formal Overview of ModuLoRA: Fine-tuning 2-Bit LLMs with Modular Quantizers

The paper "ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers" presents an advanced algorithm for fine-tuning LLMs using consumer-grade hardware. The proposed approach, ModuLoRA, introduces a methodology that enables the fine-tuning of LLMs with 65 billion parameters at reduced precision (2/3/4-bits). This is achieved with minimal computational resources, exemplified by the use of a single 24GB GPU.

Methodological Innovation

ModuLoRA expands upon the existing Low-Rank Adaptation (LoRA) technique by integrating modular quantizers capable of supporting finer granularity in weight precision. The key innovation lies in the decoupling of quantization methods from the adaptation process, allowing ModuLoRA to leverage various user-specified quantizers. This provides significant flexibility and extends the adaptation process to lower precision formats without compromising computational stability and efficiency.

Technical Framework

Quantization-Agnostic Backward Pass: The method employs a quantization-agnostic backward pass designed to efficiently manage low-precision weights materialized from a black-box quantization module. This process enables ModuLoRA to effectively utilize advanced algorithms like QuIP# for 2-bit precision and OPTQ for 3-bit precision.
Efficiency on Consumer Hardware: Demonstrating the capability to finetune a 65B model on a single 24GB GPU, the paper highlights the potential for broader accessibility of LLM fine-tuning. This is crucial for democratizing model improvements and deploying solutions in environments where high-end hardware is unavailable.

Experimental Insights

ModuLoRA shows promising results across several benchmark tasks, exhibiting effective performance in:

Text Classification: Models fine-tuned with ModuLoRA achieve competitive accuracy, closely aligning with higher precision baselines while using significantly less memory.
Natural Language Inference: The approach surpasses existing 4-bit and 8-bit methods, maintaining high accuracy comparable to full precision models.
Abstractive Summarization: A state-of-the-art ROUGE score is achieved in the SAMSum dataset, underscoring the effectiveness of the approach for generating summaries.
Instruction Following: Noteworthy performance is demonstrated on the BigBenchHard benchmark, further indicating the robustness of ModuLoRA in diverse applications.

Practical Implications

The integration of ModuLoRA into the LLMTools library provides researchers and practitioners with a user-friendly toolset facilitating the quantization, fine-tuning, and deployment of LLMs on consumer hardware. This practical implementation is instrumental in advancing open-source model development and fostering scientific progress in constrained environments.

Theoretical Implications and Future Directions

This research contributes theoretically by challenging the perceived limitations of low-precision quantization in maintaining model robustness and performance. By demonstrating that competitive accuracy can be achieved with substantially reduced precision, the paper encourages further exploration into optimizing model architectures that use even fewer resources.

Future research could focus on extending ModuLoRA to larger-scale models, such as those containing trillions of parameters, exploring advanced quantization schemas, and addressing any latency considerations introduced by efficient mixed-precision computations.

In summary, the paper delineates a significant advancement in the field of efficient LLM fine-tuning, offering practical solutions for resource-constrained environments while maintaining high performance across critical NLP tasks.

PDF Markdown

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/volokuleshov/status/1788774951836667962