Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LoRA Learns Less and Forgets Less (2405.09673v2)

Published 15 May 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for LLMs. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and full finetuning on two target domains, programming and mathematics. We consider both the instruction finetuning (approximately 100K prompt-response pairs) and continued pretraining (20B unstructured tokens) data regimes. Our results show that, in the standard low-rank settings, LoRA substantially underperforms full finetuning. Nevertheless, LoRA better maintains the base model's performance on tasks outside the target domain. We show that LoRA mitigates forgetting more than common regularization techniques such as weight decay and dropout; it also helps maintain more diverse generations. Finally, we show that full finetuning learns perturbations with a rank that is 10-100X greater than typical LoRA configurations, possibly explaining some of the reported gaps. We conclude by proposing best practices for finetuning with LoRA.

Comparing Low-Rank Adaptation (LoRA) and Full Finetuning for LLMs on Programming and Mathematics

Introduction

Finetuning LLMs with billions of parameters can be resource-intensive. Low-Rank Adaptation (LoRA) attempts to ease this burden by only training low-rank perturbations to selected weight matrices. This paper dives deep into assessing the efficacy of LoRA against full finetuning when applied to two specific domains: programming and mathematics.

Key Findings

LoRA's Performance vs Full Finetuning

The paper compared LoRA and full finetuning across two key training regimes: Instruction Finetuning (IFT) and Continued Pretraining (CPT). The results show that full finetuning almost always outperforms LoRA, particularly in code-related tasks. Here's a snapshot of what they found:

  • Programming:
    • In IFT, the best-performing LoRA setup achieved a maximum HumanEval score of 0.407, falling short of full finetuning's peak score of 0.497.
    • In CPT, LoRA peaked at a HumanEval score of 0.175, while full finetuning achieved 0.263.
  • Mathematics:
    • For Math IFT, LoRA closed the performance gap more significantly, achieving a GSM8K score of 0.622 against full finetuning's 0.642.
    • In Math CPT, LoRA reached GSM8K=0.187 at 8.6B tokens, whereas full finetuning hit 0.230.

Learning and Forgetting

One of LoRA's touted benefits is its ability to act as a form of regularizer, maintaining the base model's performance on non-target tasks. The findings show:

  • LoRA forgets less of the base model's performance on domains unrelated to the target training tasks. For instance, in code IFT, even as full finetuning pushed to HumanEval scores of 0.464 and 0.497, it also led to noticeable degradation on a composite forgetting metric (average of HellaSwag, ARC-Challenge, and WinoGrande).
  • In contrast, LoRA maintained a relatively stable forgetting score, suggesting it helps the model retain its broad capabilities better.

Regularization Properties

Though LoRA underperforms in raw accuracy, it offers a few perks:

  • Stronger Regularization than Common Techniques: LoRA emerged as a stronger regularizer compared to weight decay and dropout.
  • Maintaining Diversity in Generations: In code tasks, LoRA was able to maintain more diverse token generations compared to full finetuning, avoiding a collapse to a narrow set of solutions.

Spectral Analysis

One intriguing aspect explored in the paper was whether the weight perturbations introduced by full finetuning are low-rank—critical for justifying LoRA's design. They found that:

  • Full finetuning results in high-rank perturbations, even early in training, across nearly all model layers.
  • The rank of these perturbations increased as training progressed, which could explain why LoRA's low-rank constraints lead to performance gaps.

Practical Recommendations

To make LoRA as effective as possible, the paper offers some best practices:

  1. Identify the optimal learning rate: For LoRA, optimal learning rates were found to be substantially higher than those for full finetuning.
  2. Target All Relevant Modules: Instead of limiting LoRA to certain layers, targeting all applicable modules improved its performance significantly.
  3. Choose Rank Based on Constraints: Although higher ranks yield better performance, even lower ranks, like 16, provide a good balance of performance and memory efficiency.

Future Implications

The paper gives us a clearer picture of LoRA's strengths and limitations:

  • Domain-Specific Regularization: LoRA's ability to regularize while finetuning is beneficial for tasks requiring broader LLM capabilities to be retained.
  • Scalability Considerations: Although this paper focused on models up to 13B parameters, further studies could explore if these gaps close with even larger models.

Overall, LoRA presents a more memory-efficient but slightly less effective alternative to full finetuning, especially valuable when retaining the base model's performance on broader tasks. While it might not be the top choice for absolute performance, it remains a critical tool in the toolbox for efficiently adapting large-scale models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255, 2020.
  2. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness, 2022.
  3. Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
  4. Evaluating large language models trained on code, 2021.
  5. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
  6. Think you have solved question answering? try arc, the ai2 reasoning challenge. ArXiv, abs/1803.05457, 2018. URL https://api.semanticscholar.org/CorpusID:3922816.
  7. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  8. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  9. A study on improving reasoning in language models. In I Can’t Believe It’s Not Better Workshop: Failure Modes in the Age of Foundation Models, 2024. URL https://openreview.net/forum?id=tCZFmDyPFm.
  10. Alpacafarm: A simulation framework for methods that learn from human feedback. Advances in Neural Information Processing Systems, 36, 2024.
  11. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  12. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237, 2019.
  13. Deep learning. MIT press, 2016.
  14. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  15. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  16. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  17. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023.
  18. Forward-backward reasoning in large language models for mathematical verification, 2024.
  19. Vera: Vector-based random matrix adaptation. arXiv preprint arXiv:2310.11454, 2023.
  20. Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges. Information fusion, 58:52–68, 2020.
  21. Challenging common assumptions about catastrophic forgetting. arXiv preprint arXiv:2207.04543, 2022.
  22. Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838, 2018.
  23. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  24. Dora: Weight-decomposed low-rank adaptation. arXiv preprint arXiv:2402.09353, 2024.
  25. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  26. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  27. The e2e dataset: New challenges for end-to-end generation. arXiv preprint arXiv:1706.09254, 2017.
  28. Openwebmath: An open dataset of high-quality mathematical web text. arXiv preprint arXiv:2310.06786, 2023.
  29. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16. IEEE, 2020.
  30. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
  31. Sebastian Raschka. Practical tips for finetuning llms using lora (low-rank adaptation), 2023. URL https://magazine.sebastianraschka.com/p/practical-tips-for-finetuning-llms#%C2%A7enable-lora-for-more-layers.
  32. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  33. Winogrande: An adversarial winograd schema challenge at scale, 2019.
  34. Fine-tuned language models are continual learners. arXiv preprint arXiv:2205.12393, 2022.
  35. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  36. S-lora: Serving thousands of concurrent lora adapters, 2023.
  37. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.  1631–1642, 2013.
  38. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  39. Lamol: Language modeling for lifelong language learning. arXiv preprint arXiv:1909.03329, 2019.
  40. Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of rlhf, 2023.
  41. Stanford alpaca: An instruction-following llama model, 2023.
  42. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  43. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018.
  44. A comprehensive survey of continual learning: Theory, method and application, 2024.
  45. Magicoder: Source code is all you need. arXiv preprint arXiv:2312.02120, 2023.
  46. Large language models are better reasoners with self-verification. arXiv preprint arXiv:2212.09561, 2022.
  47. A broad-coverage challenge corpus for sentence understanding through inference. arXiv preprint arXiv:1704.05426, 2017.
  48. Chain of lora: Efficient fine-tuning of language models via residual learning. arXiv preprint arXiv:2401.04151, 2024.
  49. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  50. Hellaswag: Can a machine really finish your sentence?, 2019.
  51. Galore: Memory-efficient llm training by gradient low-rank projection. arXiv preprint arXiv:2403.03507, 2024a.
  52. Lora land: 310 fine-tuned llms that rival gpt-4, a technical report. arXiv preprint arXiv:2405.00732, 2024b.
  53. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103, 2017.
  54. Astraios: Parameter-efficient instruction tuning code large language models. arXiv preprint arXiv:2401.00788, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Dan Biderman (4 papers)
  2. Jacob Portes (6 papers)
  3. Mansheej Paul (12 papers)
  4. Philip Greengard (17 papers)
  5. Connor Jennings (8 papers)
  6. Daniel King (18 papers)
  7. Sam Havens (6 papers)
  8. Vitaliy Chiley (8 papers)
  9. Jonathan Frankle (37 papers)
  10. Cody Blakeney (7 papers)
  11. John P. Cunningham (50 papers)
  12. Jose Javier Gonzalez Ortiz (10 papers)
Citations (57)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. LoRA Learns Less and Forgets Less (177 points, 60 comments)
Reddit Logo Streamline Icon: https://streamlinehq.com

Reddit

  1. LoRA Learns Less and Forgets Less (2 points, 1 comment)