Comparing Low-Rank Adaptation (LoRA) and Full Finetuning for LLMs on Programming and Mathematics
Introduction
Finetuning LLMs with billions of parameters can be resource-intensive. Low-Rank Adaptation (LoRA) attempts to ease this burden by only training low-rank perturbations to selected weight matrices. This paper dives deep into assessing the efficacy of LoRA against full finetuning when applied to two specific domains: programming and mathematics.
Key Findings
LoRA's Performance vs Full Finetuning
The paper compared LoRA and full finetuning across two key training regimes: Instruction Finetuning (IFT) and Continued Pretraining (CPT). The results show that full finetuning almost always outperforms LoRA, particularly in code-related tasks. Here's a snapshot of what they found:
- Programming:
- In IFT, the best-performing LoRA setup achieved a maximum HumanEval score of 0.407, falling short of full finetuning's peak score of 0.497.
- In CPT, LoRA peaked at a HumanEval score of 0.175, while full finetuning achieved 0.263.
- Mathematics:
- For Math IFT, LoRA closed the performance gap more significantly, achieving a GSM8K score of 0.622 against full finetuning's 0.642.
- In Math CPT, LoRA reached GSM8K=0.187 at 8.6B tokens, whereas full finetuning hit 0.230.
Learning and Forgetting
One of LoRA's touted benefits is its ability to act as a form of regularizer, maintaining the base model's performance on non-target tasks. The findings show:
- LoRA forgets less of the base model's performance on domains unrelated to the target training tasks. For instance, in code IFT, even as full finetuning pushed to HumanEval scores of 0.464 and 0.497, it also led to noticeable degradation on a composite forgetting metric (average of HellaSwag, ARC-Challenge, and WinoGrande).
- In contrast, LoRA maintained a relatively stable forgetting score, suggesting it helps the model retain its broad capabilities better.
Regularization Properties
Though LoRA underperforms in raw accuracy, it offers a few perks:
- Stronger Regularization than Common Techniques: LoRA emerged as a stronger regularizer compared to weight decay and dropout.
- Maintaining Diversity in Generations: In code tasks, LoRA was able to maintain more diverse token generations compared to full finetuning, avoiding a collapse to a narrow set of solutions.
Spectral Analysis
One intriguing aspect explored in the paper was whether the weight perturbations introduced by full finetuning are low-rank—critical for justifying LoRA's design. They found that:
- Full finetuning results in high-rank perturbations, even early in training, across nearly all model layers.
- The rank of these perturbations increased as training progressed, which could explain why LoRA's low-rank constraints lead to performance gaps.
Practical Recommendations
To make LoRA as effective as possible, the paper offers some best practices:
- Identify the optimal learning rate: For LoRA, optimal learning rates were found to be substantially higher than those for full finetuning.
- Target All Relevant Modules: Instead of limiting LoRA to certain layers, targeting all applicable modules improved its performance significantly.
- Choose Rank Based on Constraints: Although higher ranks yield better performance, even lower ranks, like 16, provide a good balance of performance and memory efficiency.
Future Implications
The paper gives us a clearer picture of LoRA's strengths and limitations:
- Domain-Specific Regularization: LoRA's ability to regularize while finetuning is beneficial for tasks requiring broader LLM capabilities to be retained.
- Scalability Considerations: Although this paper focused on models up to 13B parameters, further studies could explore if these gaps close with even larger models.
Overall, LoRA presents a more memory-efficient but slightly less effective alternative to full finetuning, especially valuable when retaining the base model's performance on broader tasks. While it might not be the top choice for absolute performance, it remains a critical tool in the toolbox for efficiently adapting large-scale models.