Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

80 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

48 456

ReLoRA: High-Rank Training Through Low-Rank Updates (2307.05695v4)

Published 11 Jul 2023 in cs.CL and cs.LG

Abstract: Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparameterized models remains poorly understood, while training costs grow exponentially. In this paper, we explore parameter-efficient training techniques as an approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to training transformer LLMs with up to 1.3B parameters and demonstrate comparable performance to regular neural network training. ReLoRA saves up to 5.5Gb of RAM per GPU and improves training speed by 9-40% depending on the model size and hardware setup. Our findings show the potential of parameter-efficient techniques for large-scale pre-training.

PDF HTML Abstract

Introducing ReLoRA

Research in AI demonstrates a trend towards training larger networks, a costly initiative that requires vast computational resources. This paper presents an alternative approach to training these overparameterized models efficiently – ReLoRA, or Regularized Low-Rank Approximation. ReLoRA facilitates the training of large, high-rank neural networks by strategically updating the network through a sequence of low-rank approximations.

The Mechanics of ReLoRA

ReLoRA is grounded in the principle that the rank of the sum of two matrices is lower than or equal to the sum of their respective ranks. The method begins with a low-rank parameterization technique, LoRA, and builds upon that by consecutively applying low-rank updates to the network parameters. Iteratively merging these updates and reinitializing the network's trainable parameters incrementally raise the effective rank of the model.

Unlike conventional stochastic gradient descent methods, ReLoRA modifies the traditional optimization approach to accommodate its unique update process. By introducing resets at specified intervals to both the network parameters and the optimizer states, as well as employing a customized learning rate schedule, ReLoRA overcomes the challenges posed by its novel training methodology.

Experimentation and Findings

The efficiency of ReLoRA was rigorously tested on transformer LLMs equipped with up to 1.3 billion parameters. Despite a reduction in the number of trainable parameters during most of the training process, ReLoRA achieved performance comparable to full network training. Impressively, not only did the technique save substantial GPU RAM per device, it also sped up the training process by percentages that varied with the model size and hardware configuration.

Sustainable and Scalable AI Training

This method provides an economically viable solution for training large neural networks. By leveraging a blend of full-rank early training and subsequent low-rank updates, ReLoRA allows for significant improvements in memory savings and training speed. Furthermore, the benefits of ReLoRA become even more pronounced on less advanced hardware, widening its application to a broader spectrum of AI research groups.

In conclusion, ReLoRA ushers in a technique that improves upon the efficiency of existing parameter-efficient fine-tuning methods. As the research community continues to scale AI models, ReLoRA offers a promising pathway to more accessible and sustainable training, potentially revolutionizing the way we approach the development of large neural networks.

PDF Markdown Bookmark Chat (Pro)

References (51)

Authors (4)

Vladislav Lialin (14 papers)
Namrata Shivagunde (5 papers)
Sherin Muckatira (5 papers)
Anna Rumshisky (42 papers)

Citations (67)

View on Semantic Scholar

GitHub

GitHub - Guitaricet/relora: Official code for ReLoRA from the paper Stack More Layers Differently: High-Rank Training Through Low-Rank Updates (456 stars)

Tweets

https://twitter.com/22146921/status/1736386045254762503

https://twitter.com/rosmine_b/status/1903917021915381835

https://twitter.com/332223673/status/1736056341343449506

https://twitter.com/rosmine_b/status/1859338831180558600

https://twitter.com/Yampeleg/status/1782704898657640917

https://twitter.com/jd_pressman/status/1848542388547162282