LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models (2310.08659v4)

Published 12 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Quantization is an indispensable technique for serving LLMs and has recently found its way into LoRA fine-tuning. In this work we focus on the scenario where quantization and LoRA fine-tuning are applied together on a pre-trained model. In such cases it is common to observe a consistent gap in the performance on downstream tasks between full fine-tuning and quantization plus LoRA fine-tuning approach. In response, we propose LoftQ (LoRA-Fine-Tuning-aware Quantization), a novel quantization framework that simultaneously quantizes an LLM and finds a proper low-rank initialization for LoRA fine-tuning. Such an initialization alleviates the discrepancy between the quantized and full-precision model and significantly improves generalization in downstream tasks. We evaluate our method on natural language understanding, question answering, summarization, and natural language generation tasks. Experiments show that our method is highly effective and outperforms existing quantization methods, especially in the challenging 2-bit and 2/4-bit mixed precision regimes. The code is available on https://github.com/yxli2123/LoftQ.

PDF HTML Abstract

Introduction to LoftQ

Quantization is a vital step in deploying LLMs, crucially optimizing them for limited-resource scenarios without sacrificing performance. The paper presented addresses the challenges of combining quantization with Low-Rank Adaptation (LoRA) fine-tuning, unveiling LoftQ, a novel framework that promises effective LLM quantization.

Problem with Current Quantization Practices

Quantization significantly reduces the size of LLMs, converting high-precision numbers into more compact formats. When paired with LoRA fine-tuning, quantization traditionally follows a straightforward process that unfortunately overlooks the impacts on the initialization phase of fine-tuning, leading to performance gaps on downstream tasks. Previous methods such as QLoRA have particularly struggled under stringent conditions like the 2-bit regime, where the models' performance noticeably declines.

Introducing LoftQ: A New Approach

LoftQ is tailored to address these low-precision challenges by integrating low-rank approximation into the quantization process, jointly refining the approximation of original pre-trained weights and their quantized counterparts. This innovation aims to bridge the gap between the quantized start point and the fully trained model, fostering better generalization in downstream applications.

Empirical Validation and Results

To substantiate the efficacy of LoftQ, the researchers conducted extensive experiments across a diverse range of language tasks, including natural language understanding, question answering, summarization, and generation tasks. LoftQ consistently surpassed previous methods, particularly highlighting its prowess in scenarios involving 2-bit and mixed 2/4-bit precision—a significant achievement demonstrating its robust potential in task-specific model adaptations within the low-bit spectrum. The results are particularly promising as they didn't just match but sometimes exceeded full-precision baselines.

Through a series of benchmarks on models such as DeBERTaV3-base, BART-large, and the LLAMA-2 series, LoftQ showcased an impressive capability to converge to acceptable performance levels, where its counterpart, QLoRA, could not.

Conclusion and Implications

LoftQ offers a compelling solution to a complex problem—the efficient and effective quantization of LLMs to suit resource-constrained deployment without notable losses in performance. Its ability to fine-tune in low-bit environments without deteriorating results sets a new standard for LLM quantization frameworks. As the demand for deploying LLMs continues to rise in various computational environments, LoftQ could play a crucial role in democratizing access to advanced LLMs.