LoRA: Low-Rank Adaptation of Large Language Models (2106.09685v2)

Published 17 Jun 2021 in cs.CL, cs.AI, and cs.LG

Abstract: An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in LLM adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.

PDF Abstract

The paper introduces Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning technique for LLM (LLMs). LoRA addresses the challenge of adapting large, pre-trained models to downstream tasks without incurring the computational and storage costs associated with full fine-tuning. The authors posit that weight updates during adaptation reside on a low intrinsic dimension and propose to approximate the update matrix using a low-rank decomposition.

LoRA freezes the pre-trained model parameters and introduces trainable low-rank matrices, $A \in \mathbb{R}^{r \times k}$ and $B \in \mathbb{R}^{d \times r}$ , into each layer of the Transformer architecture. The forward pass is modified as follows:

$h = W_0x + BAx$

where:

$h$ is the output vector
$W_0$ $W_{0}$ is the original weight matrix
- $x$ is the input vector
$A$ is the first low-rank matrix with a size of $r \times k$
$B$ is the second low-rank matrix with a size of $d \times r$
$r$ is the rank of the LoRA module, where $r \ll \min(d,k)$
$d$ is the output dimension size of a Transformer layer
$k$ is the input dimension size of a Transformer layer

$A$ is initialized with a random Gaussian distribution, while $B$ is initialized to zero, ensuring that the adaptation starts with no change to the original model. The output of the low-rank adaptation is scaled by $\alpha/r$ , where $\alpha$ is a constant, to reduce hyperparameter tuning.

The authors present several advantages of LoRA:

Reduced storage requirements: Only the low-rank matrices $A$ and $B$ need to be stored for each task, significantly reducing the storage footprint compared to storing a full fine-tuned model.
Increased training efficiency: By freezing the pre-trained weights, LoRA reduces the number of trainable parameters, leading to faster training times and reduced memory usage.
No additional inference latency: The low-rank matrices can be merged with the original weights during deployment, eliminating any additional inference latency.
Modularity: LoRA modules can be easily swapped to switch between tasks, enabling efficient task-switching in production environments.

The paper evaluates LoRA on a range of tasks, including natural language understanding (NLU) and natural language generation (NLG), using models such as RoBERTa, DeBERTa, GPT-2, and GPT-3.

On the GLUE benchmark, LoRA achieves performance comparable to full fine-tuning on RoBERTa base/large and DeBERTa XXL, while significantly reducing the number of trainable parameters. For example, LoRA with RoBERTa-large attains an average score of 89.0, on par with the fine-tuning baseline of 88.9.

On the E2E NLG Challenge, LoRA outperforms several baselines with comparable or fewer trainable parameters on GPT-2 medium/large. Specifically, LoRA achieved a BLEU score of 70.4 on GPT-2 Medium and 70.4 on GPT-2 Large, surpassing other parameter-efficient methods.

Scaling up to GPT-3 175B, LoRA matches or exceeds the performance of full fine-tuning on WikiSQL, MNLI, and SAMSum datasets. For instance, LoRA attains an accuracy of 73.4 on WikiSQL and 91.7 on MNLI, outperforming the fine-tuning baseline.

The authors investigate the properties of the low-rank adaptation learned from downstream tasks. They find that adapting both the query ( $W_q$ ) and value ( $W_v$ ) projection matrices in the self-attention module yields the best performance, given a limited parameter budget. They also find that a very low rank (e.g., r = 1 or 2) suffices for adapting $W_q$ and $W_v$ , suggesting that the update matrix $\Delta W$ has a small "intrinsic rank."

Further analysis reveals that the top singular vector directions of the learned adaptation matrices are consistent across different random seeds, indicating that the adaptation matrix indeed has a very low rank. The adaptation matrix $\Delta W$ amplifies features that are already present in the pre-trained weight matrix $W$ , but are not emphasized. The authors observe a high amplification factor, suggesting that the low-rank adaptation matrix amplifies important features for specific downstream tasks that were learned but not emphasized during pre-training.