LoRA+: Efficient Low Rank Adaptation of Large Models (2402.12354v2)

Published 19 Feb 2024 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: In this paper, we show that Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021) leads to suboptimal finetuning of models with large width (embedding dimension). This is due to the fact that adapter matrices A and B in LoRA are updated with the same learning rate. Using scaling arguments for large width networks, we demonstrate that using the same learning rate for A and B does not allow efficient feature learning. We then show that this suboptimality of LoRA can be corrected simply by setting different learning rates for the LoRA adapter matrices A and B with a well-chosen ratio. We call this proposed algorithm LoRA$+$. In our extensive experiments, LoRA$+$ improves performance (1-2 $\%$ improvements) and finetuning speed (up to $\sim$ 2X SpeedUp), at the same computational cost as LoRA.

PDF Abstract

LoRA+: Enhancing Performance and Efficiency in Fine-Tuning LLMs

Introduction to Low Rank Adaptation

The quest for advanced performance in natural language processing has led to the development of increasingly large models, encompassing tens or even hundreds of billions of parameters. While these state-of-the-art (SOTA) models demonstrate remarkable capabilities across a variety of tasks, their vast size presents significant challenges for fine-tuning, especially for users without access to substantial computational resources. Recognizing the need for more efficient fine-tuning methods that do not compromise on performance, this paper introduces an improved strategy for Low Rank Adaptation (LoRA) finetuning, termed LoRA+. The essence of LoRA, a method standardly employed in the industry for fine-tuning large models, is to adapt a model to new tasks by fine-tuning only a small set of parameters while freezing the original pre-trained weights. This approach drastically cuts down the computational cost associated with fine-tuning. However, our paper identifies a critical inefficiency in the standard LoRA approach concerning the learning rate, thereby limiting the method's effectiveness.

Identifying LoRA's Limitations

Our investigation begins with the observation that in typical LoRA implementations, the adapter matrices A and B are updated using the same learning rate. This common practice, we argue, leads to suboptimal feature learning and, consequently, poorer performance in fine-tuning tasks, especially as the model's width (embedding dimension) grows. Through an analytical exploration supported by a toy model, we demonstrate that this inefficiency stems from the equal learning rate assumption, which inadequately handles the distinct roles and shapes of matrices A and B in adaptation.

Introducing LoRA+

To address this limitation, we propose a simple yet effective modification to the LoRA method, which we designate as LoRA+. The cornerstone of our approach is the differential treatment of learning rates for the adapter matrices. Specifically, we suggest setting the learning rate for matrix B to be substantially higher than that for matrix A. Our theoretical analysis, rooted in the scaling limits of wide neural networks, elucidates why such a differentiated approach to learning rates fosters more efficient feature learning during the adaptative process.

Empirical Validation and Implications

The theoretical insights guiding the development of LoRA+ are empirically substantiated through extensive experimentation across multiple LLMs and fine-tuning tasks. Our results convincingly show that LoRA+, with its differential learning rate strategy, not only enhances performance by 1%-2% but also doubles the fine-tuning speed at no additional computational cost compared to the standard LoRA method. These findings are robust across a variety of settings, including tasks of varying complexity and models of different sizes.

Future Outlook

While LoRA+ significantly improves upon the efficiency and performance of the standard LoRA approach, determining the optimal ratio of learning rates for matrices A and B remains an open question. This ratio can be model and task-dependent, suggesting a tailored approach might be necessary for maximizing gains in specific contexts. Future research in this area may provide more nuanced guidelines for practitioners, facilitating even greater efficiency and effectiveness in fine-tuning LLMs.

Conclusion

In summary, LoRA+, with its novel differential learning rate strategy for adapter matrices, presents a powerful adjustment to the established LoRA method for fine-tuning LLMs. By addressing a previously unrecognized inefficiency in the standard approach, LoRA+ opens the door to more effective and efficient utilization of large pre-trained models across a wider range of tasks and computational settings.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Soufiane Hayou (26 papers)
Nikhil Ghosh (11 papers)
Bin Yu (168 papers)

Citations (86)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - nikhil-ghosh-berkeley/loraplus (219 stars)

Tweets

https://twitter.com/vikhyatk/status/1783930919197544463

https://twitter.com/vikhyatk/status/1784282576238354675

https://twitter.com/nsaphra/status/1834260761792639363

https://twitter.com/YouJiacheng/status/1873087577487753534

https://twitter.com/omarsar0/status/1760063235057979475

https://twitter.com/hayou_soufiane/status/1760033515415241021