The paper introduces Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning technique for LLM (LLMs). LoRA addresses the challenge of adapting large, pre-trained models to downstream tasks without incurring the computational and storage costs associated with full fine-tuning. The authors posit that weight updates during adaptation reside on a low intrinsic dimension and propose to approximate the update matrix using a low-rank decomposition.
LoRA freezes the pre-trained model parameters and introduces trainable low-rank matrices, and , into each layer of the Transformer architecture. The forward pass is modified as follows:
where:
- is the output vector
- is the original weight matrix
- is the input vector
- is the first low-rank matrix with a size of
- is the second low-rank matrix with a size of
- is the rank of the LoRA module, where
- is the output dimension size of a Transformer layer
- is the input dimension size of a Transformer layer
is initialized with a random Gaussian distribution, while is initialized to zero, ensuring that the adaptation starts with no change to the original model. The output of the low-rank adaptation is scaled by , where is a constant, to reduce hyperparameter tuning.
The authors present several advantages of LoRA:
- Reduced storage requirements: Only the low-rank matrices and need to be stored for each task, significantly reducing the storage footprint compared to storing a full fine-tuned model.
- Increased training efficiency: By freezing the pre-trained weights, LoRA reduces the number of trainable parameters, leading to faster training times and reduced memory usage.
- No additional inference latency: The low-rank matrices can be merged with the original weights during deployment, eliminating any additional inference latency.
- Modularity: LoRA modules can be easily swapped to switch between tasks, enabling efficient task-switching in production environments.
The paper evaluates LoRA on a range of tasks, including natural language understanding (NLU) and natural language generation (NLG), using models such as RoBERTa, DeBERTa, GPT-2, and GPT-3.
On the GLUE benchmark, LoRA achieves performance comparable to full fine-tuning on RoBERTa base/large and DeBERTa XXL, while significantly reducing the number of trainable parameters. For example, LoRA with RoBERTa-large attains an average score of 89.0, on par with the fine-tuning baseline of 88.9.
On the E2E NLG Challenge, LoRA outperforms several baselines with comparable or fewer trainable parameters on GPT-2 medium/large. Specifically, LoRA achieved a BLEU score of 70.4 on GPT-2 Medium and 70.4 on GPT-2 Large, surpassing other parameter-efficient methods.
Scaling up to GPT-3 175B, LoRA matches or exceeds the performance of full fine-tuning on WikiSQL, MNLI, and SAMSum datasets. For instance, LoRA attains an accuracy of 73.4 on WikiSQL and 91.7 on MNLI, outperforming the fine-tuning baseline.
The authors investigate the properties of the low-rank adaptation learned from downstream tasks. They find that adapting both the query () and value () projection matrices in the self-attention module yields the best performance, given a limited parameter budget. They also find that a very low rank (e.g., r = 1 or 2) suffices for adapting and , suggesting that the update matrix has a small "intrinsic rank."
Further analysis reveals that the top singular vector directions of the learned adaptation matrices are consistent across different random seeds, indicating that the adaptation matrix indeed has a very low rank. The adaptation matrix amplifies features that are already present in the pre-trained weight matrix , but are not emphasized. The authors observe a high amplification factor, suggesting that the low-rank adaptation matrix amplifies important features for specific downstream tasks that were learned but not emphasized during pre-training.