Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 47 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 13 tok/s Pro
GPT-5 High 12 tok/s Pro
GPT-4o 64 tok/s Pro
Kimi K2 160 tok/s Pro
GPT OSS 120B 452 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

QLoRA: Efficient Finetuning of Quantized LLMs (2305.14314v1)

Published 23 May 2023 in cs.LG

Abstract: We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained LLM into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.

Citations (1,825)

Summary

  • The paper introduces QLoRA, a method that finetunes LLMs using 4-bit quantization and LoRA adapters to reduce GPU memory requirements.
  • It employs innovations like NormalFloat 4-bit, double quantization, and paged optimizers to manage memory spikes and maintain training speed.
  • The approach achieves comparable results to 16-bit finetuning on benchmarks, making large-scale model tuning accessible to resource-limited researchers.

QLoRA: Efficient Finetuning of Quantized LLMs

Introduction

The paper "QLoRA: Efficient Finetuning of Quantized LLMs" presents QLoRA, an advanced finetuning technique for LLMs that significantly reduces memory footprints. Unlike traditional 16-bit finetuning which requires extensive GPU memory, QLoRA demonstrates the ability to finetune a 65 billion parameter model using just 48GB of GPU memory. It uses a frozen, 4-bit quantized LLM and backpropagates gradients into Low-Rank Adapters (LoRA).

Methodology

QLoRA introduces several innovations to optimize the finetuning process:

  1. 4-bit NormalFloat (NF4): A new data type optimized for normally distributed weights, facilitating efficient quantization without performance loss.
  2. Double Quantization: This technique involves quantizing the quantization constants themselves, leading to a considerable reduction in memory usage.
  3. Paged Optimizers: Designed to handle memory spikes by using paged memory, thus preventing out-of-memory errors during gradient check-pointing.

These innovations collectively allow QLoRA to perform finetuning with performance matching that of full 16-bit finetuned models. Figure 1

Figure 1: Different finetuning methods and their memory requirements. QLoRA improves over LoRA by quantizing the transformer model to 4-bit precision and using paged optimizers to handle memory spikes.

Implementation Details

QLoRA can be operationalized by integrating with existing neural network frameworks. Specific implementation steps include:

  • Quantization: Convert model parameters to the NF4 data type through quantile approximation and scaling to match normal distribution characteristics.
  • Double Quantization: Apply a secondary quantization to the scale factors from the first quantization to manage memory footprint.
  • LoRA Adapter Integration: Introduce LoRA with adequate rank for each layer to ensure the full expressive power of the model is retained while reducing the trainable parameter count.

This approach not only reduces the memory requirements but also maintains high training speeds, which is crucial for practical deployment. Figure 2

Figure 2: LoRA r for LLaMA 7B models finetuned on Alpaca. Each dot represents a combination of hyperparameters and for each LoRA r we run 3 random seed with each hyperparameter combination. The performance of specific LoRA r values appears to be independent of other hyperparameters.

Performance Analysis

QLoRA has been validated on over 1,000 models across two main domains: instruction-following performance and chatbot efficiency. It exhibited performance on par with state-of-the-art models on benchmarks such as Vicuna, achieving almost the same effectiveness as ChatGPT but with dramatically lower resource consumption.

Instruction Tuning and Chatbot Performance

QLoRA's instruction-following capabilities are demonstrated through extensive experimentation across diverse datasets. Notably, using high-quality small datasets can achieve competitive performance compared to larger datasets. Furthermore, when applied to the Guanaco family of models, QLoRA outperformed open models on various benchmarks.

Advantages and Implications

QLoRA's efficiency significantly democratizes the training of LLMs, enabling researchers with limited resources to train massive models locally. This opens new possibilities for applications that demand privacy and local execution, such as on-device personal assistants.

Additionally, the strong performance of QLoRA on instruction-following tasks may encourage broader adoption in applications requiring model customization through simple finetuning, further bridging the gap between generic and specialized LLMs.

Conclusion

QLoRA represents a pivotal advancement in the finetuning of LLMs, successfully reducing the necessary hardware resources without sacrificing performance. As its techniques become more integrated with educational tools, business applications, and personal assistants, QLoRA can spearhead accessible AI deployment across various sectors. Future work may explore further quantization levels and broader adapter types to enhance efficiency and application reach.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube