Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LoRA: Low-Rank Adaptation of Large Language Models (2106.09685v2)

Published 17 Jun 2021 in cs.CL, cs.AI, and cs.LG
LoRA: Low-Rank Adaptation of Large Language Models

Abstract: An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in LLM adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.

The paper introduces Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning technique for LLM (LLMs). LoRA addresses the challenge of adapting large, pre-trained models to downstream tasks without incurring the computational and storage costs associated with full fine-tuning. The authors posit that weight updates during adaptation reside on a low intrinsic dimension and propose to approximate the update matrix using a low-rank decomposition.

LoRA freezes the pre-trained model parameters and introduces trainable low-rank matrices, ARr×kA \in \mathbb{R}^{r \times k} and BRd×rB \in \mathbb{R}^{d \times r}, into each layer of the Transformer architecture. The forward pass is modified as follows:

h=W0x+BAxh = W_0x + BAx

where:

  • hh is the output vector
  • W0W_0 is the original weight matrix
    • xx is the input vector
  • AA is the first low-rank matrix with a size of r×kr \times k
  • BB is the second low-rank matrix with a size of d×rd \times r
  • rr is the rank of the LoRA module, where rmin(d,k)r \ll \min(d,k)
  • dd is the output dimension size of a Transformer layer
  • kk is the input dimension size of a Transformer layer

AA is initialized with a random Gaussian distribution, while BB is initialized to zero, ensuring that the adaptation starts with no change to the original model. The output of the low-rank adaptation is scaled by α/r\alpha/r, where α\alpha is a constant, to reduce hyperparameter tuning.

The authors present several advantages of LoRA:

  • Reduced storage requirements: Only the low-rank matrices AA and BB need to be stored for each task, significantly reducing the storage footprint compared to storing a full fine-tuned model.
  • Increased training efficiency: By freezing the pre-trained weights, LoRA reduces the number of trainable parameters, leading to faster training times and reduced memory usage.
  • No additional inference latency: The low-rank matrices can be merged with the original weights during deployment, eliminating any additional inference latency.
  • Modularity: LoRA modules can be easily swapped to switch between tasks, enabling efficient task-switching in production environments.

The paper evaluates LoRA on a range of tasks, including natural language understanding (NLU) and natural language generation (NLG), using models such as RoBERTa, DeBERTa, GPT-2, and GPT-3.

On the GLUE benchmark, LoRA achieves performance comparable to full fine-tuning on RoBERTa base/large and DeBERTa XXL, while significantly reducing the number of trainable parameters. For example, LoRA with RoBERTa-large attains an average score of 89.0, on par with the fine-tuning baseline of 88.9.

On the E2E NLG Challenge, LoRA outperforms several baselines with comparable or fewer trainable parameters on GPT-2 medium/large. Specifically, LoRA achieved a BLEU score of 70.4 on GPT-2 Medium and 70.4 on GPT-2 Large, surpassing other parameter-efficient methods.

Scaling up to GPT-3 175B, LoRA matches or exceeds the performance of full fine-tuning on WikiSQL, MNLI, and SAMSum datasets. For instance, LoRA attains an accuracy of 73.4 on WikiSQL and 91.7 on MNLI, outperforming the fine-tuning baseline.

The authors investigate the properties of the low-rank adaptation learned from downstream tasks. They find that adapting both the query (WqW_q) and value (WvW_v) projection matrices in the self-attention module yields the best performance, given a limited parameter budget. They also find that a very low rank (e.g., r = 1 or 2) suffices for adapting WqW_q and WvW_v, suggesting that the update matrix ΔW\Delta W has a small "intrinsic rank."

Further analysis reveals that the top singular vector directions of the learned adaptation matrices are consistent across different random seeds, indicating that the adaptation matrix indeed has a very low rank. The adaptation matrix ΔW\Delta W amplifies features that are already present in the pre-trained weight matrix WW, but are not emphasized. The authors observe a high amplification factor, suggesting that the low-rank adaptation matrix amplifies important features for specific downstream tasks that were learned but not emphasized during pre-training.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Edward J. Hu (7 papers)
  2. Yelong Shen (83 papers)
  3. Phillip Wallis (4 papers)
  4. Zeyuan Allen-Zhu (53 papers)
  5. Yuanzhi Li (119 papers)
  6. Shean Wang (1 paper)
  7. Lu Wang (329 papers)
  8. Weizhu Chen (128 papers)
Citations (7,090)
Youtube Logo Streamline Icon: https://streamlinehq.com