Memory-Efficient LLM Training with Online Subspace Descent (2408.12857v1)

Published 23 Aug 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Recently, a wide range of memory-efficient LLM training algorithms have gained substantial popularity. These methods leverage the low-rank structure of gradients to project optimizer states into a subspace using projection matrix found by singular value decomposition (SVD). However, convergence of these algorithms is highly dependent on the update rules of their projection matrix. In this work, we provide the \emph{first} convergence guarantee for arbitrary update rules of projection matrix. This guarantee is generally applicable to optimizers that can be analyzed with Hamiltonian Descent, including most common ones, such as LION, Adam. Inspired by our theoretical understanding, we propose Online Subspace Descent, a new family of subspace descent optimizer without SVD. Instead of updating the projection matrix with eigenvectors, Online Subspace Descent updates the projection matrix with online PCA. Online Subspace Descent is flexible and introduces only minimum overhead to training. We show that for the task of pretraining LLaMA models ranging from 60M to 7B parameters on the C4 dataset, Online Subspace Descent achieves lower perplexity and better downstream tasks performance than state-of-the-art low-rank training methods across different settings and narrows the gap with full-rank baselines.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces Online Subspace Descent, offering the first convergence guarantee for arbitrary projection matrix updates within a Hamiltonian Descent framework.
It replaces expensive SVD with an online PCA approach, enabling efficient and adaptive projection updates in LLM training.
Empirical results on LLaMA models demonstrate significant perplexity reduction and faster execution compared to traditional low-rank methods.

Memory-Efficient LLM Training with Online Subspace Descent

Introduction

The recent proliferation of LLMs has prompted intensified research into optimizing memory efficiency during training. This paper introduces an innovative approach toward this goal, termed Online Subspace Descent. Traditional methods like Stochastic Subspace Descent and LoRA have leveraged the low-rank structure of gradients, projecting optimizer states into a subspace via singular value decomposition (SVD). However, a gap remains in guaranteeing convergence for arbitrary update rules of the projection matrix, especially for non-convex function optimization. This paper bridges that gap while proposing an efficient alternative to SVD.

Methodology

Convergence Guarantee in Hamiltonian Descent Framework

The authors first establish a theoretical foundation by providing the first convergence guarantee for arbitrary update rules of the projection matrix. This convergence guarantee is analyzed within the Hamiltonian Descent framework and applies to widely-used optimizers such as LION and Adam. This theoretical underpinning is significant because it extends beyond specific or narrowly defined update rules, demonstrating robustness across a variety of optimizers.

Online Subspace Descent without SVD

Drawing from this theoretical understanding, the paper proposes Online Subspace Descent—an optimizer family that eschews the expensive SVD computation in favor of online principal component analysis (PCA). This dynamic, low-overhead method allows the projection matrix to adapt fluidly throughout the training process in response to the changing gradient landscape. Such adaptation is more aligned with the progressive nature of deep learning, where model requirements evolve over time.

Algorithmic Details

An exemplary algorithm layout (Algorithm 1) is provided to elucidate the Online Subspace Descent mechanism. The notable aspect of the algorithm is its dual-optimizer approach, where a principal optimizer (e.g., Adam) updates the model weights, while a secondary optimizer updates the projection matrix. This division allows simultaneous updates with minimal interference, enhancing computational parallelism and efficiency.

Experimental Results

Performance on LLaMA Models

Empirical results underscore the efficacy of Online Subspace Descent in closing the performance gap between state-of-the-art low-rank methods and full-rank baselines. The paper reports substantial reductions in perplexity for LLaMA models of various sizes. For instance, the perplexity for a 1B parameter LLaMA model (sequence length of 256) improves significantly compared to prior methods like GaLore.

Moreover, the efficiency of Online Subspace Descent is evident in practical execution times. As highlighted in Figure 2, the typical PyTorch SVD implementation is significantly slower than a single-step online PCA. Subsequently, the updates for the projection matrix in Online Subspace Descent can be performed in sync with weight updates, maintaining seamless training operations without substantial overhead.

Implications and Future Directions

The practical implications of Online Subspace Descent are multi-faceted:

Scalability: Suitable for scaling to larger model architectures without incurring prohibitive computational costs.
Resource Efficiency: Demonstrates enhanced memory efficiency, making it feasible to train even more extensive and complex models on existing hardware.

The work opens up several avenues for future research. Primarily, there is an opportunity to explore alternative projection matrix update methods that could further accelerate convergence. Additionally, assigning a role to weight decay in the convergence analysis presents another intriguing direction. Beyond LLMing, the principles of Online Subspace Descent could potentially be adapted to other domains, enhancing the versatility of the approach.

Conclusion

The paper "Memory-Efficient LLM Training with Online Subspace Descent" contributes a theoretically sound and practically efficient method for memory-optimized training of LLMs. By leveraging the Hamiltonian Descent framework, the authors provide robust convergence guarantees while innovatively employing online PCA to circumvent the computational overhead associated with SVD. This novel approach not only improves training efficiency and performance but also sets the stage for further research into memory-efficient deep learning methodologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/KyleLiang5/status/1827901482668085447

https://twitter.com/fly51fly/status/1828186731566506237

https://twitter.com/gm8xx8/status/1827889590050406477

https://twitter.com/GptMaestro/status/1828371815816913362

https://twitter.com/arXivGPT/status/1828524243782406417