Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ReLoRA: High-Rank Training Through Low-Rank Updates (2307.05695v4)

Published 11 Jul 2023 in cs.CL and cs.LG

Abstract: Despite the dominance and effectiveness of scaling, resulting in large networks with hundreds of billions of parameters, the necessity to train overparameterized models remains poorly understood, while training costs grow exponentially. In this paper, we explore parameter-efficient training techniques as an approach to training large neural networks. We introduce a novel method called ReLoRA, which utilizes low-rank updates to train high-rank networks. We apply ReLoRA to training transformer LLMs with up to 1.3B parameters and demonstrate comparable performance to regular neural network training. ReLoRA saves up to 5.5Gb of RAM per GPU and improves training speed by 9-40% depending on the model size and hardware setup. Our findings show the potential of parameter-efficient techniques for large-scale pre-training.

Introducing ReLoRA

Research in AI demonstrates a trend towards training larger networks, a costly initiative that requires vast computational resources. This paper presents an alternative approach to training these overparameterized models efficiently – ReLoRA, or Regularized Low-Rank Approximation. ReLoRA facilitates the training of large, high-rank neural networks by strategically updating the network through a sequence of low-rank approximations.

The Mechanics of ReLoRA

ReLoRA is grounded in the principle that the rank of the sum of two matrices is lower than or equal to the sum of their respective ranks. The method begins with a low-rank parameterization technique, LoRA, and builds upon that by consecutively applying low-rank updates to the network parameters. Iteratively merging these updates and reinitializing the network's trainable parameters incrementally raise the effective rank of the model.

Unlike conventional stochastic gradient descent methods, ReLoRA modifies the traditional optimization approach to accommodate its unique update process. By introducing resets at specified intervals to both the network parameters and the optimizer states, as well as employing a customized learning rate schedule, ReLoRA overcomes the challenges posed by its novel training methodology.

Experimentation and Findings

The efficiency of ReLoRA was rigorously tested on transformer LLMs equipped with up to 1.3 billion parameters. Despite a reduction in the number of trainable parameters during most of the training process, ReLoRA achieved performance comparable to full network training. Impressively, not only did the technique save substantial GPU RAM per device, it also sped up the training process by percentages that varied with the model size and hardware configuration.

Sustainable and Scalable AI Training

This method provides an economically viable solution for training large neural networks. By leveraging a blend of full-rank early training and subsequent low-rank updates, ReLoRA allows for significant improvements in memory savings and training speed. Furthermore, the benefits of ReLoRA become even more pronounced on less advanced hardware, widening its application to a broader spectrum of AI research groups.

In conclusion, ReLoRA ushers in a technique that improves upon the efficiency of existing parameter-efficient fine-tuning methods. As the research community continues to scale AI models, ReLoRA offers a promising pathway to more accessible and sustainable training, potentially revolutionizing the way we approach the development of large neural networks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319–7328, Online, Aug. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.568. URL https://aclanthology.org/2021.acl-long.568.
  2. A convergence theory for deep learning via over-parameterization. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 242–252. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/allen-zhu19a.html.
  3. Implicit regularization in deep matrix factorization, 2019.
  4. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116:15849 – 15854, 2018.
  5. Low-rank bottleneck in multi-head attention models. In International Conference on Machine Learning, pages 864–873. PMLR, 2020.
  6. Improving language models by retrieving from trillions of tokens. In K. Chaudhuri, S. Jegelka, L. Song, C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 2206–2240. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/borgeaud22a.html.
  7. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  8. Palm: Scaling language modeling with pathways. ArXiv, abs/2204.02311, 2022.
  9. Flashattention: Fast and memory-efficient exact attention with IO-awareness. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=H4DqfPSibmx.
  10. GPT3.int8(): 8-bit matrix multiplication for transformers at scale. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=dXiGWqBoxaD.
  11. Qlora: Efficient finetuning of quantized llms. ArXiv, abs/2305.14314, 2023. URL https://api.semanticscholar.org/CorpusID:258841328.
  12. Krona: Parameter efficient tuning with kronecker adapter. ArXiv, abs/2212.10650, 2022. URL https://api.semanticscholar.org/CorpusID:254926823.
  13. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. J. Mach. Learn. Res., 23(1), jan 2022. ISSN 1532-4435.
  14. J. Frankle and M. Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJl-b3RcF7.
  15. Stabilizing the lottery ticket hypothesis. arXiv e-prints, pages arXiv–1903, 2019.
  16. Scaling laws for neural machine translation. ArXiv, abs/2109.07740, 2021.
  17. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. CoRR, abs/1502.01852, 2015. URL http://arxiv.org/abs/1502.01852.
  18. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  19. An empirical analysis of compute-optimal large language model training. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=iBBcRUlOAPR.
  20. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  21. Y. Idelbayev and M. A. Carreira-Perpinan. Low-rank compression of neural nets: Learning the rank of each layer. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8046–8056, 2020. doi: 10.1109/CVPR42600.2020.00807.
  22. Neural tangent kernel: Convergence and generalization in neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 8580–8589, Red Hook, NY, USA, 2018. Curran Associates Inc.
  23. Speeding up convolutional neural networks with low rank expansions. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014. doi: http://dx.doi.org/10.5244/C.28.88.
  24. Exploring low rank training of deep neural networks. ArXiv, abs/2209.13569, 2022. URL https://api.semanticscholar.org/CorpusID:252545358.
  25. Scaling laws for neural language models. ArXiv, abs/2001.08361, 2020.
  26. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=HklBjCEKvH.
  27. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2015.
  28. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
  29. Scaling down to scale up: A guide to parameter-efficient fine-tuning, 2023.
  30. Hotcake: Higher order tucker articulated kernels for deeper cnn compression. In 2020 IEEE 15th International Conference on Solid-State & Integrated Circuit Technology (ICSICT), pages 1–4, 2020. doi: 10.1109/ICSICT49897.2020.9278257.
  31. Compacter: Efficient low-rank hypercomplex adapter layers. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=bqGK5PyI6-N.
  32. Mixed precision training. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1gs9JgRZ.
  33. Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021, 2019.
  34. Improving language understanding by generative pre-training. 2018.
  35. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  36. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16, 2020. doi: 10.1109/SC41405.2020.00024.
  37. Low-rank lottery tickets: finding efficient low-rank neural networks via matrix differential equations. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 20051–20063. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/7e98b00eeafcdaeb0c5661fb9355be3a-Paper-Conference.pdf.
  38. N. Shazeer. Glu variants improve transformer, 2020.
  39. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In Y. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1409.1556.
  40. Analytic insights into structure and rank of neural network hessian maps. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=otDgw7LM7Nn.
  41. Roformer: Enhanced transformer with rotary position embedding. ArXiv, abs/2104.09864, 2021.
  42. ELRT: Towards efficient low-rank training for compact neural networks, 2023. URL https://openreview.net/forum?id=TC39w69m8bB.
  43. Lst: Ladder side-tuning for parameter and memory efficient transfer learning. ArXiv, abs/2206.06522, 2022. URL https://api.semanticscholar.org/CorpusID:249642544.
  44. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  45. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  46. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  47. Growing efficient deep networks by structured continuous sparsification. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=wb3wxCObbRT.
  48. B. Zhang and R. Sennrich. Root mean square layer normalization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/1e8a19426224ca89e83cef47f1e7f53b-Paper.pdf.
  49. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Sy8gdB9xx.
  50. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64:107 – 115, 2021.
  51. Inrank: Incremental low-rank learning. arXiv preprint arXiv:2306.11250, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Vladislav Lialin (14 papers)
  2. Namrata Shivagunde (5 papers)
  3. Sherin Muckatira (5 papers)
  4. Anna Rumshisky (42 papers)
Citations (67)