GaLore 2: Large-Scale LLM Pre-Training by Gradient Low-Rank Projection (2504.20437v1)

Published 29 Apr 2025 in cs.LG and cs.AI

Abstract: LLMs have revolutionized natural language understanding and generation but face significant memory bottlenecks during training. GaLore, Gradient Low-Rank Projection, addresses this issue by leveraging the inherent low-rank structure of weight gradients, enabling substantial memory savings without sacrificing performance. Recent works further extend GaLore from various aspects, including low-bit quantization and higher-order tensor structures. However, there are several remaining challenges for GaLore, such as the computational overhead of SVD for subspace updates and the integration with state-of-the-art training parallelization strategies (e.g., FSDP). In this paper, we present GaLore 2, an efficient and scalable GaLore framework that addresses these challenges and incorporates recent advancements. In addition, we demonstrate the scalability of GaLore 2 by pre-training Llama 7B from scratch using up to 500 billion training tokens, highlighting its potential impact on real LLM pre-training scenarios.

Summary

The paper introduces gradient low-rank projection to significantly cut optimizer and gradient memory, enabling training of Llama 7B on 500B tokens with modest hardware.
The paper details algorithmic enhancements such as randomized SVD for fast subspace updates and seamless FSDP integration to maintain model quality in distributed environments.
The paper demonstrates empirical parity with full-rank optimizers, with improved semantic similarity in certain tasks and robust convergence in large-scale pre-training.

GaLore 2: Large-Scale LLM Pre-Training with Gradient Low-Rank Projection

GaLore 2 systematically addresses the core challenge of reducing memory bottlenecks in LLM pre-training by leveraging and extending the principle of gradient low-rank projection. The main contributions are (1) algorithmic improvements for subspace updates and SVD computation, (2) seamless integration with modern distributed training strategies, and (3) empirical validation at considerable scale (training Llama 7B on 500B tokens). The results demonstrate that large-scale gradient low-rank projection can reduce optimizer state and gradient memory by a substantial margin, enabling practical training on modestly sized hardware without noticeable loss in model quality—even for the most demanding downstream tasks.

Gradient Low-Rank Projection: Mechanism and Memory Analysis

The central technique in both the original and enhanced GaLore is the projection of dense layer gradients $G_t$ onto a low-rank subspace defined by a projection matrix $P_t$ , typically derived via SVD. In the context of matrix-valued model parameters $W_t \in \mathbb{R}^{m\times n}$ , the gradient projection step and update are as follows (for $m \leq n$ ):

G_t = compute_gradient(W_t)

U, S, V = svd(G_t)
P_t = U[:, :r]  # retain r leading singular vectors

R_t = P_t.T @ G_t

M_t = beta1 * M_t_1 + (1 - beta1) * R_t
V_t = beta2 * V_t_1 + (1 - beta2) * (R_t ** 2)
N_t = M_t / (sqrt(V_t) + epsilon)

G_update = alpha * P_t @ N_t
W_t = W_t_1 + eta * G_update

This procedure replaces $O(mn)$ memory for storing first and second order Adam moments with $O(nr)$ , plus the additional $O(mr)$ for $P_t$ . For large $m, n$ and moderate $r$ , this trade-off is highly favorable, allowing e.g., a 7B-parameter model to be trained on a 24GB consumer GPU.

Fast Randomized SVD for Subspace Updates

The original GaLore's SVD subspace computation presents a significant bottleneck for wide matrices, especially at scale. GaLore 2 adopts randomized SVD algorithms (cf. Halko et al., 2011) to approximate the truncated SVD efficiently, yielding empirical speedups up to 15x relative to standard SVD without measurable degradation in downstream model quality. Projectors calculated in this fashion maintain adequate spectral fidelity, with ablation results showing clear degradation only for very coarse (random or extreme quantized) approximations.

GaLore 2 empirically verifies that randomized SVD and moderate-frequency subspace updates avoid the instability caused by sign indeterminacy and stochastic SVD output—when $T$ (subspace change frequency) is around 200–500 steps, the sign flips are washed out by gradient drift.

Extensions: Quantization and Tensor Projections

Incorporation of Q-GaLore and Tensor-GaLore variants enables further reduction in memory and compute by supporting low-bit quantized projectors and gradient representations, and generalizing the projection to higher-order parameter tensors, including convolutional and attention layers. GaLore 2 is compatible with bitsandbytes 8-bit Adam optimizer, which allows all gradient statistics to be maintained in a memory-efficient low-precision format.

Distributed Training: FSDP Integration

GaLore 2 supports PyTorch FSDP (Fully Sharded Data Parallelism), addressing a key limitation of the original GaLore in distributed environments. The integration allows per-layer low-rank updates after reduce-scatter, and replicates SVD results efficiently across devices, maximizing sharding efficiency. A head-to-head comparison with FSDP+AdamW on Llama3 8B shows GaLore+FSDP achieving a memory reduction from 77.64GB to 72.84GB per GPU at 2048 sequence length. For extremely long sequences (4096), baseline AdamW+FSDP exceeds device memory, while GaLore+FSDP remains at 77.45GB, indicating practical viability for extreme-scale training on standard clusters.

Large-Scale Pretraining Results

Pretraining experiments on Llama 7B with 500B tokens on 256×H100 GPUs demonstrate robust convergence for GaLore, closely matching the validation loss curve of the 8-bit Adam baseline. Notably, while GaLore initially exhibits slightly slower convergence, it surpasses the baseline during the mid-stage of training, then closely tracks the baseline for the remainder. By the end of training, both methods reach equivalent validation perplexity.

Numerically, GaLore achieves functional parity with full-rank baselines across all downstream tasks, with some notable differences:

Language Understanding and Reasoning: Average 0.37 for both models.
Commonsense & Contextual Reasoning: Baseline 0.41 vs GaLore 0.40 (marginal deficit).
Paraphrase & Semantic Similarity: GaLore 0.67, baseline 0.64 (stronger performance for GaLore).
Truthfulness & Factual Accuracy: 0.30 both.
Academic & Professional Exams: 0.24 both.

These evaluations confirm that low-rank gradient projection does not compromise model performance, and, in some semantic similarity tasks, may even offer gains.

Implementation and Deployment Considerations

Hyperparameters & Robustness: GaLore is robust with respect to the learning rate ( $\alpha=0.125$ for rank=1024), and does not require model-size-dependent tuning, unlike many optimizer alternatives.

Compute Overhead: The dominant cost is subspace SVD; replacing full SVD with randomized SVD is essential for scaling to high parameter counts and frequent subspace updates.

Deployment: GaLore reduces the hardware requirement for LLM training and enables experimentation with larger batch sizes, sequence lengths, or architectures (e.g., increasing parameter count, windowed attention) within a fixed memory budget. Integration with FSDP and bitsandbytes ensures bottlenecks in both single-node and multi-node deployments are addressed.

Limitations: Performance degradation is observed with excessively aggressive subspace approximation (random or highly quantized projectors), imposing a lower bound on projector fidelity. The technique is tied to the algebraic structure of the gradient and may require adaptation for non-matrix-valued layers.

Implications and Future Directions

GaLore 2 empirically validates the practical benefits of gradient low-rank projection for LLM training, demonstrating that substantial reduction in optimizer and gradient memory is possible without negative impact on model quality. Its scalable implementation with randomized SVD and integration into FSDP closes the gap between memory-frugal algorithms and scalable, distributed pre-training.

This line of research opens possibilities for further advances:

Extending low-rank projection techniques to additional optimizer-state compression (beyond gradients and moments)
Co-designing model architectures and training pipelines around low-rank update assumptions
Integrating more aggressive quantization in projectors and moment storage for even broader hardware support
Exploring low-rank projection for continual and online learning scenarios, where memory is severely constrained
Adapting projection techniques for parameter-efficient fine-tuning methods (beyond LoRA/ReLoRA)

GaLore 2 provides a concrete foundation for democratizing large-scale model training by changing the fundamental trade-off between parameter count, sequence length, and available memory. Its demonstrated parity with established optimizers—at a fraction of the memory cost—will likely drive adoption and future research into principled, structure-aware memory optimization for large neural architectures.