Papers
Topics
Authors
Recent
Search
2000 character limit reached

LoRA: Low-Rank Adaptation of Large Language Models

Published 17 Jun 2021 in cs.CL, cs.AI, and cs.LG | (2106.09685v2)

Abstract: An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in LLM adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA.

Citations (7,090)

Summary

  • The paper introduces a novel low-rank adaptation (LoRA) method to update large language models efficiently by injecting trainable low-rank matrices.
  • It significantly reduces the number of trainable parameters, making the process scalable and suitable for resource-constrained environments.
  • Empirical evaluations across models like GPT-3 and RoBERTa show that LoRA achieves competitive or superior performance compared to full fine-tuning.

LoRA: Low-Rank Adaptation of LLMs

The "LoRA: Low-Rank Adaptation of LLMs" introduces a novel approach to adapting large pre-trained LLMs for specific downstream tasks. This technique, called Low-Rank Adaptation (LoRA), proposes an efficient and scalable way to improve the performance of these models without requiring the complete retraining of their extensive parameter sets.

Motivation and Concept

As pre-trained models grow in size, like the 175 billion parameters of GPT-3, traditional fine-tuning methods become resource-intensive and impractical. LoRA addresses this by introducing trainable rank decomposition matrices into each layer of the Transformer, allowing the model weights to remain frozen during adaptation. This reduces the number of trainable parameters significantly, facilitating efficient training and deployment.

Implementation Details

LoRA modifies dense layers within a neural network by constraining updates to learnable matrices AA and BB, representing weight changes using a low-rank decomposition ΔW=BA\Delta W = BA. Here, BRd×rB \in \mathbb{R}^{d \times r} and ARr×kA \in \mathbb{R}^{r \times k}, where rmin(d,k)r \ll \min(d, k), allowing for efficient adaptation:

1
2
3
def lora_forward(x, W_0, A, B, alpha=16):
    Delta_W = (B @ A) * (alpha / A.shape[0])
    return W_0 @ x + Delta_W @ x

Thus, LoRA minimally alters the architecture while avoiding additional deployment latency, as the matrices AA and BB can be absorbed into W0W_0 post-training.

Practical Benefits and Trade-offs

The primary advantage of LoRA lies in its resource efficiency: reducing required GPU memory for training and model storage makes it suitable for deployment contexts with limited infrastructure. This is particularly advantageous during the frequent task switching in environments requiring multiple task-specific models. However, the inherent design disallows simultaneous batching of different tasks unless additional mechanisms are integrated, such as modular sampling. Figure 1

Figure 1: GPT-3 175B validation accuracy vs. number of trainable parameters of several adaptation methods on WikiSQL and MNLI-matched. LoRA exhibits better scalability and task performance.

Performance Evaluation

Extensive empirical exploration across diverse tasks and models, including RoBERTa, DeBERTa, GPT-2, and GPT-3, demonstrates LoRA's capability to achieve competitive or superior performance compared to full fine-tuning. Critical insights into LoRA's rank sufficiency reveal that low-rank updates capture essential task-specific information, supported by subspace similarity measures showing minimal rank sufficiency for effective adaptation.

Theoretical Insights and Limitations

LoRA posits that the adaptive updates in LLMs intuitively exhibit low intrinsic dimensionality. This hypothesis is verified through subspace analysis, where directions captured in low-rank matrices effectively encompass the critical aspects necessary for task-specific adaptation. Nevertheless, the technique relies on heuristics for optimal matrix selection, which presents a potential opportunity for further research. Figure 2

Figure 2: Left and Middle: Normalized subspace similarity between the column vectors of Ar=64A_{r=64} from multiple random seeds, confirming consistent low-rank capture across variations.

Conclusion and Future Directions

LoRA stands as a robust alternative to conventional fine-tuning approaches, allowing for significant parameter efficiency gains and reduced resource demands. Its promising results invite further investigations into combining it with other adaptation techniques, optimizing rank selections, and expanding its application beyond the existing scope to potentially redefine parameter-efficient model adaptation in NLP.

The development of LoRA notably contributes to easing the deployment of effective NLP systems in resource-constrained environments, aligning with contemporary challenges in scalability and sustainability in machine learning.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper introduces a simple, smart way to customize very LLMs (like GPT-3) for new tasks without retraining the whole model. The method is called LoRA, which stands for Low-Rank Adaptation. The big idea: instead of changing all the model’s billions of settings (“parameters”), you keep the original model frozen and only train a few small add-on pieces that gently nudge it toward the new task.

Key Objectives

The paper set out to answer three plain questions:

  • How can we adapt huge LLMs to new tasks without needing massive computer power and storage?
  • Can a small number of changes be enough to make a big model work well on lots of different tasks?
  • Will this approach be fast at runtime and still match or beat the usual full fine-tuning in quality?

How LoRA Works (Simple Explanation)

Think of a giant, high-end camera that’s already trained to take great photos in many conditions. Full fine-tuning is like opening the camera and reconfiguring every internal part for each new type of shot—slow, risky, and expensive. LoRA is like clipping on a tiny lens filter to adjust the picture for your new scene. You don’t change the camera; you just add a lightweight attachment.

Here’s the everyday version of the technical idea:

  • A Transformer (the type of model used in GPT) has many “weight matrices”—big grids of numbers that transform inputs into outputs.
  • Full fine-tuning updates all those big grids, which is costly for huge models.
  • LoRA freezes the original big grids and adds two small helper matrices, called A and B, per chosen layer. When you multiply B and A, you get a tiny “change matrix” that gently adjusts the original weights.
  • This “low-rank” change means the adjustment is limited to a few “directions,” like moving a little bit north-east instead of exploring every possible direction. “Rank” is just the number of those directions; a small rank (like 1–4) can still do a lot.

Practical details:

  • The paper mostly applies LoRA to the attention parts of Transformers (often the query and value projection matrices, named Wq and Wv), because they matter a lot for how the model focuses on different words.
  • The original weights stay frozen; only the little A and B matrices are trained.
  • At deployment time, you can combine the small changes with the original weights so the model runs as fast as usual—no extra delay.

Main Findings and Why They Matter

Here are the most important takeaways put simply:

  • Much fewer trainable parameters: LoRA can reduce the number of trainable parameters by up to 10,000 times compared to full fine-tuning on GPT-3. That’s like swapping a full toolbox for a couple of tiny tools that still get the job done.
  • Lower memory and faster training: Because you’re only training the small A and B pieces, training uses around 3× less GPU memory and often runs faster. This lowers costs and makes customization more accessible.
  • No extra runtime delay: Unlike some other methods (like “adapters”) that add extra steps when the model runs, LoRA adds no extra latency. You can merge the changes so the model behaves like a fully fine-tuned one.
  • Performance matches or beats full fine-tuning: Across many models and tasks—RoBERTa, DeBERTa, GPT-2, and even huge GPT-3—LoRA often matches or outperforms full fine-tuning and other parameter-efficient techniques (like adapters or prompt/prefix tuning).
  • Very small ranks work: Surprisingly, even tiny ranks (like r=1 or r=2) were enough for good results in several tasks. This suggests that the specific changes needed to adapt a LLM are often simple and lie in just a few “directions.”
  • Best layers to adapt: Adapting the attention projections for queries and values (Wq and Wv) together tended to give strong results for the same parameter budget.

The paper backs these claims with experiments on:

  • Understanding tasks (GLUE benchmark) using RoBERTa and DeBERTa.
  • Generation tasks using GPT-2 (like the E2E NLG Challenge).
  • Big tasks using GPT-3 (WikiSQL for SQL generation, MNLI for natural language inference, SAMSum for conversation summarization).

Implications and Impact

In plain terms, LoRA makes it much easier to personalize giant LLMs for many different uses:

  • Companies can keep one big base model and swap tiny LoRA “modules” for different tasks, saving storage and making updates faster.
  • Developers with fewer resources can still adapt top models, opening the door to more innovation and fair access.
  • Because LoRA doesn’t slow down the model when it runs, it’s suitable for real-world apps that care about quick responses.
  • LoRA can be combined with other techniques, giving even more flexibility.

Overall, LoRA shows that you don’t need to overhaul a massive model to make it great at new tasks—smart, small adjustments can be enough. This could speed up progress in language AI while keeping costs and energy use in check.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 85 tweets with 916 likes about this paper.