GaLore$+$: Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection (2412.19820v1)

Published 15 Dec 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Recent low-rank training methods, such as GaLore, have significantly reduced the memory required to optimize LLMs. However, these methods often suffer from time-consuming low-rank projection estimations. In particular, the singular value decomposition (SVD) in GaLore can consume more than 80\% of the total training time. To address this issue, we propose GaLore$+$, which uses cross-head low-rank projection to reduce the substantial time consumption in estimating low-rank projections for multi-head attention. In addition, we employ randomized subspace iteration to achieve fast SVD. To further enhance performance, we propose sparsely coded residuals to reduce the errors caused by low-rank approximation on the first- and second-order moments of the optimizers and weight updates. We evaluate GaLore$+$ on arithmetic reasoning and natural language generation datasets. Our experiments demonstrate that GaLore$+$ delivers superior performance while achieving approximately $4\times$ fine-tuning speed compared to vanilla GaLore.

Summary

The paper introduces GaLore+, a novel method for significantly boosting the speed and efficiency of low-rank adaptation for large language models.
GaLore+ uses cross-head projection and fast randomized SVD to reduce the computational cost associated with estimating projection matrices during training.
Experiments show GaLore+ fine-tuning LLaMA2-7B up to four times faster than GaLore and achieving better accuracy than LoRA, DoRA, and original GaLore.

GaLore $+$ : Advancing Low-Rank Adaptation in LLMs with Cross-Head Projection

The paper "GaLore $+$ : Boosting Low-Rank Adaptation for LLMs with Cross-Head Projection" addresses an imperative challenge within the domain of LLM fine-tuning, particularly for models exceeding billions of parameters, such as LLaMA2-7B and LLaMA2-13B. The focus is on enhancing the efficiency of low-rank adaptation methods, specifically by tackling the high computational costs associated with singular value decomposition (SVD) in existing frameworks like GaLore.

Key Innovations

Cross-Head Low-Rank Projection: The paper introduces a method of cross-head low-rank projection as a means to significantly cut down on the time complexity of estimating projection matrices in multi-head attention layers. By sharing the low-rank projection matrices across multiple query or key projections, the computational burden is reduced from $O(h^3)$ to $O(h)$ , where $h$ is the number of attention heads. This approach leverages the inherent similarity between the query and key transformations across different attention heads, thereby simplifying the required SVD operation.
Randomized Subspace Iteration for Fast SVD: GaLore $+$ utilizes randomized subspace iteration to further expedite the SVD process. This algorithm decreases the time complexity of calculating $r$ -rank approximations from $O(mn \times \min(m, n))$ to $O(mn \times \log(r))$ . The implementation of this method not only accelerates computation but also mitigates memory usage, making it suitable for large-scale model fine-tuning on resource-constrained hardware.
Sparsely Coded Residuals: To alleviate the errors introduced by low-rank approximation, the authors propose using sparsely coded residuals for enhancing the estimation of first- and second-order optimizer moments. By maintaining a sparse representation of these residuals, the fidelity of weight updates is improved without compromising on memory efficiency.

Empirical Evaluation

The evaluation of GaLore $+$ was conducted on arithmetic reasoning tasks using GSM8K and MAWPS datasets and natural language generation tasks using datasets like E2E. In these experiments, GaLore $+$ consistently surpassed traditional methods like LoRA, DoRA, and the original GaLore in both accuracy and time efficiency. For instance, GaLore $+$ demonstrated a fourfold increase in fine-tuning speed over vanilla GaLore when applied to LLaMA2-7B, highlighting its significant computational efficiency gains.

Implications and Future Work

The advancements presented in GaLore $+$ hold substantial implications for the deployment and adaptation of LLMs in memory and compute-constrained environments. By reducing both the time and memory footprint of training processes, this approach democratizes the accessibility to fine-tuning LLMs, providing feasibility for smaller organizations and individual researchers to leverage large models effectively.

For future work, further refinements could include theoretical analyses to refine the approximations made by cross-head low-rank projection. Additionally, exploring dynamic adjustment mechanisms for the ranking and updating of sparse residual matrices may yield even finer control over the model adaptation process. These enhancements would not only bolster the theoretical robustness of GaLore $+$ but could also extend its applicability across an even broader array of model architectures and applications.