GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection (2403.03507v2)

Published 6 Mar 2024 in cs.LG

Abstract: Training LLMs presents significant memory challenges, predominantly due to the growing size of weights and optimizer states. Common memory-reduction approaches, such as low-rank adaptation (LoRA), add a trainable low-rank matrix to the frozen pre-trained weight in each layer, reducing trainable parameters and optimizer states. However, such approaches typically underperform training with full-rank weights in both pre-training and fine-tuning stages since they limit the parameter search to a low-rank subspace and alter the training dynamics, and further, may require full-rank warm start. In this work, we propose Gradient Low-Rank Projection (GaLore), a training strategy that allows full-parameter learning but is more memory-efficient than common low-rank adaptation methods such as LoRA. Our approach reduces memory usage by up to 65.5% in optimizer states while maintaining both efficiency and performance for pre-training on LLaMA 1B and 7B architectures with C4 dataset with up to 19.7B tokens, and on fine-tuning RoBERTa on GLUE tasks. Our 8-bit GaLore further reduces optimizer memory by up to 82.5% and total training memory by 63.3%, compared to a BF16 baseline. Notably, we demonstrate, for the first time, the feasibility of pre-training a 7B model on consumer GPUs with 24GB memory (e.g., NVIDIA RTX 4090) without model parallel, checkpointing, or offloading strategies.

PDF HTML Abstract

Memory-Efficient LLM Training with Gradient Low-Rank Projection

Introduction to Gradient Low-Rank Projection (\lowrank{})

Training LLMs poses significant memory challenges due to the large size of weights and optimizer states involved. Existing memory reduction techniques often involve low-rank adaptation methods, such as Low-Rank Adaptation (LoRA), which reparameterizes each layer's weight matrix as a sum of its original weight and a trainable low-rank matrix. Despite their efficacy in reducing the number of trainable parameters and associated optimizer states, these methods often underperform compared to full-rank training, especially in both pre-training and fine-tuning stages. This limitation is attributed mainly to the restrictive nature of low-rank parameterization and its alteration of training dynamics.

To address these challenges, we introduce Gradient Low-Rank Projection (\textbf{\lowrank{}}), a training strategy designed for both pre-training and fine-tuning LLMs that is more memory-efficient than traditional low-rank methods. Unlike LoRA, which directly imposes a low-rank structure on model weights, \lowrank{} capitalizes on the inherently low-rank structure of gradient updates during training. This strategy enables full-parameter learning while significantly reducing memory consumption.

Theoretical Insights and Methodology

Our work starts with demonstrating theoretically that the backpropagated gradient matrix becomes increasingly low-rank as training progresses. This insight leads to the core idea of \lowrank{}: projecting gradients into a low-rank subspace before applying optimizer updates. Specifically, for any weight update at time step $t$ , \lowrank{} projects the gradient $G_t$ onto matrices $P_t \in \mathbb{R}^{m \times r}$ and $Q_t \in \mathbb{R}^{n \times r}$ , yielding a low-rank gradient form. Consequently, only the gradients' low-rank projections need to be stored in optimizer states, resulting in substantial memory savings.

Moreover, we provide a convergence analysis of \lowrank{} under certain gradient update forms, ensuring its effectiveness in both theoretical and practical settings. Importantly, \lowrank{} allows for dynamic adjustments of projection matrices during training, thus supporting full-parameter learning without increasing memory load.

Experimental Results

We thoroughly evaluate \lowrank{} on LLaMA-based models across different sizes, from 60M to 7B parameters, utilizing the C4 dataset for pre-training. Our findings indicate that \lowrank{} closely matches the performance of full-rank models while significantly reducing memory usage, proving its superiority over traditional low-rank adaptation methods like LoRA and ReLoRA. In particular, for a 7B parameter model, \lowrank{}, combined with 8-bit optimizer techniques and layer-wise weight updates, substantially outperforms full-rank training in memory efficiency without sacrificing training effectiveness.

Notably, the memory savings enabled by 8-bit \lowrank{} make it feasible to pre-train a 7B parameter model on consumer-level GPUs, such as NVIDIA RTX 4090, demonstrating its practical utility for large-scale LLM training within constrained memory environments.

Concluding Thoughts and Future Directions

\lowrank{} exemplifies a novel approach to memory-efficient training of LLMs by exploiting the low-rank structure of gradient updates. Its effectiveness in both pre-training and fine-tuning contexts signifies a notable advancement towards reducing the computational and environmental costs associated with LLM training. Looking forward, exploring further optimizations of \lowrank{}, including more memory-efficient projection matrices and its applicability to other model architectures and optimization strategies, presents promising avenues for continuing research in this area.

PDF Markdown Bookmark Chat (Pro)

References (34)

Authors (6)

Jiawei Zhao (30 papers)
Zhenyu Zhang (249 papers)
Beidi Chen (61 papers)
Zhangyang Wang (374 papers)
Anima Anandkumar (236 papers)
Yuandong Tian (128 papers)

Citations (118)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - jiaweizzhao/GaLore (1,572 stars)

Tweets

https://twitter.com/_akhaliq/status/1765598376312152538

https://twitter.com/s_scardapane/status/1840707664256409739

https://twitter.com/Ar_Douillard/status/1863535810273747200

https://twitter.com/MaximZiatdinov/status/1765858928464998623

https://twitter.com/jiawzhao/status/1770474840241086964

https://twitter.com/hillbig/status/1766950456700604602