QLoRA: Efficient Finetuning of Quantized LLMs
(2305.14314v1)
Published 23 May 2023 in cs.LG
Abstract: We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained LLM into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.
The paper presents NF4 quantization and double quantization techniques that significantly reduce memory usage during LLM finetuning.
It employs paged optimizers to manage CPU-GPU memory efficiently, preventing out-of-memory errors during gradient checkpointing.
The Guanaco model, fine-tuned with QLoRA, achieves performance near ChatGPT on the Vicuna benchmark, highlighting its practical benefits.
QLoRA presents an efficient fine-tuning approach for LLMs, significantly reducing memory usage while preserving performance. This is achieved through a combination of 4-bit NormalFloat (NF4) quantization, double quantization, and paged optimizers. The Guanaco model family, fine-tuned using QLoRA, attains performance levels close to ChatGPT on the Vicuna benchmark, requiring only 24 hours of fine-tuning on a single GPU.
Key Innovations in QLoRA
QLoRA introduces three key innovations to enable efficient finetuning of LLMs: NF4 quantization, double quantization, and paged optimizers.
4-bit NormalFloat (NF4) Quantization
NF4 is a novel data type optimized for quantizing normally distributed weights, commonly found in pre-trained neural networks. It leverages Quantile Quantization, ensuring an equal number of values from the input tensor are assigned to each quantization bin. The process involves estimating quantiles of a theoretical N(0,1) distribution, normalizing values into the [-1, 1] range, and then quantizing the input weight tensor by rescaling it to fit within that range. To ensure a discrete zeropoint of 0 and to use all 2k bits for a k-bit datatype, an asymmetric data type is created. NF4 improves performance compared to FP4 and Int4 by packing more information into fewer bits during weight quantization, minimizing information loss and preserving the finetuned model's performance.
Double Quantization
This technique involves quantizing the quantization constants (scales) used in the initial 4-bit quantization. The quantization constants from the first quantization are treated as inputs to a second quantization step. This second step yields the quantized quantization constants and a second level of quantization constants. 8-bit Floats with a blocksize of 256 are used for the second quantization. Double quantization reduces the memory overhead introduced by the quantization constants, further decreasing the memory footprint of the quantized model. This allows larger models to be finetuned within a given memory budget without significantly degrading performance. The paper estimates savings of approximately 0.37 bits per parameter.
Paged Optimizers
Paged optimizers leverage NVIDIA's unified memory feature, which automatically handles page-to-page transfers between CPU and GPU memory. Paged memory is allocated for the optimizer states, and when the GPU runs out of memory, these states are automatically moved to CPU RAM and then brought back to the GPU when needed for the optimizer update step. This prevents out-of-memory errors during gradient checkpointing, particularly when processing mini-batches with long sequence lengths. By paging optimizer states to CPU memory, QLoRA avoids memory spikes, allowing finetuning to proceed without crashing. While paging might introduce slowdowns, the paper reports that these were not measurable in the described experiments.
Related Work by Tim Dettmers
Tim Dettmers has also contributed to other significant works related to efficient training and compression of large models:
"LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" (Dettmers et al., 2022): Introduces a procedure for Int8 matrix multiplication for feed-forward and attention projection layers in transformers, which cuts the memory needed for inference by half while retaining full precision performance. It uses vector-wise quantization and a mixed-precision decomposition scheme to handle emergent outliers.
"The case for 4-bit precision: k-bit Inference Scaling Laws" (Dettmers et al., 2022): Explores the trade-off between bit-precision and model size in LLMs, finding that 4-bit precision is almost universally optimal for total model bits and zero-shot accuracy.
"SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression" (Dettmers et al., 2023): Introduces a new compressed format and quantization technique called Sparse-Quantized Representation (SpQR), which enables near-lossless compression of LLMs across model scales. It identifies and isolates outlier weights, storing them in higher precision while compressing all other weights to 3-4 bits.
"8-bit Optimizers via Block-wise Quantization" (Dettmers et al., 2021): Develops the first optimizers that use 8-bit statistics while maintaining the performance levels of using 32-bit optimizer states, using block-wise dynamic quantization.
"SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient" (Ryabinin et al., 2023): Proposes SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous, and unreliable devices.
These works, including QLoRA, showcase a consistent focus on reducing the computational and memory demands of large models, making them more accessible and efficient to train and deploy.