LoRA Learns Less and Forgets Less (2405.09673v2)

Published 15 May 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for LLMs. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning (approximately 100K prompt-response pairs) and continued pretraining (20B unstructured tokens) data regimes. Our results show that, in the standard low-rank settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA better maintains the base model's performance on tasks outside the target domain. We show that LoRA mitigates forgetting more than common regularization techniques such as weight decay and dropout; it also helps maintain more diverse generations. Finally, we show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.

PDF HTML Abstract

Comparing Low-Rank Adaptation (LoRA) and Full Finetuning for LLMs on Programming and Mathematics

Introduction

Finetuning LLMs with billions of parameters can be resource-intensive. Low-Rank Adaptation (LoRA) attempts to ease this burden by only training low-rank perturbations to selected weight matrices. This paper dives deep into assessing the efficacy of LoRA against full finetuning when applied to two specific domains: programming and mathematics.

Key Findings

LoRA's Performance vs Full Finetuning

The paper compared LoRA and full finetuning across two key training regimes: Instruction Finetuning (IFT) and Continued Pretraining (CPT). The results show that full finetuning almost always outperforms LoRA, particularly in code-related tasks. Here's a snapshot of what they found:

Programming:
- In IFT, the best-performing LoRA setup achieved a maximum HumanEval score of 0.407, falling short of full finetuning's peak score of 0.497.
- In CPT, LoRA peaked at a HumanEval score of 0.175, while full finetuning achieved 0.263.
Mathematics:
- For Math IFT, LoRA closed the performance gap more significantly, achieving a GSM8K score of 0.622 against full finetuning's 0.642.
- In Math CPT, LoRA reached GSM8K=0.187 at 8.6B tokens, whereas full finetuning hit 0.230.

Learning and Forgetting

One of LoRA's touted benefits is its ability to act as a form of regularizer, maintaining the base model's performance on non-target tasks. The findings show:

LoRA forgets less of the base model's performance on domains unrelated to the target training tasks. For instance, in code IFT, even as full finetuning pushed to HumanEval scores of 0.464 and 0.497, it also led to noticeable degradation on a composite forgetting metric (average of HellaSwag, ARC-Challenge, and WinoGrande).
In contrast, LoRA maintained a relatively stable forgetting score, suggesting it helps the model retain its broad capabilities better.

Regularization Properties

Though LoRA underperforms in raw accuracy, it offers a few perks:

Stronger Regularization than Common Techniques: LoRA emerged as a stronger regularizer compared to weight decay and dropout.
Maintaining Diversity in Generations: In code tasks, LoRA was able to maintain more diverse token generations compared to full finetuning, avoiding a collapse to a narrow set of solutions.

Spectral Analysis

One intriguing aspect explored in the paper was whether the weight perturbations introduced by full finetuning are low-rank—critical for justifying LoRA's design. They found that:

Full finetuning results in high-rank perturbations, even early in training, across nearly all model layers.
The rank of these perturbations increased as training progressed, which could explain why LoRA's low-rank constraints lead to performance gaps.

Practical Recommendations

To make LoRA as effective as possible, the paper offers some best practices:

Identify the optimal learning rate: For LoRA, optimal learning rates were found to be substantially higher than those for full finetuning.
Target All Relevant Modules: Instead of limiting LoRA to certain layers, targeting all applicable modules improved its performance significantly.
Choose Rank Based on Constraints: Although higher ranks yield better performance, even lower ranks, like 16, provide a good balance of performance and memory efficiency.

Future Implications

The paper gives us a clearer picture of LoRA's strengths and limitations:

Domain-Specific Regularization: LoRA's ability to regularize while finetuning is beneficial for tasks requiring broader LLM capabilities to be retained.
Scalability Considerations: Although this paper focused on models up to 13B parameters, further studies could explore if these gaps close with even larger models.

Overall, LoRA presents a more memory-efficient but slightly less effective alternative to full finetuning, especially valuable when retaining the base model's performance on broader tasks. While it might not be the top choice for absolute performance, it remains a critical tool in the toolbox for efficiently adapting large-scale models.

PDF Markdown Bookmark Chat (Pro)

References (54)

Authors (12)

Dan Biderman (4 papers)
Jacob Portes (6 papers)
Mansheej Paul (12 papers)
Philip Greengard (17 papers)
Connor Jennings (8 papers)
Daniel King (18 papers)
Sam Havens (6 papers)
Vitaliy Chiley (8 papers)
Jonathan Frankle (37 papers)
Cody Blakeney (7 papers)
John P. Cunningham (50 papers)
Jose Javier Gonzalez Ortiz (10 papers)

Citations (57)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/rasbt/status/1791810329812369767

https://twitter.com/dan_biderman/status/1791506475010977875

https://twitter.com/_akhaliq/status/1791289063384707173

https://twitter.com/iScienceLuvr/status/1791290512588050458

https://twitter.com/rohanpaul_ai/status/1837289444455407814

https://twitter.com/fly51fly/status/1791419399976673438