RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation (2401.04679v7)

Published 9 Jan 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We investigate parameter-efficient fine-tuning (PEFT) methods that can provide good accuracy under limited computational and memory budgets in the context of LLMs. We present a new PEFT method called Robust Adaptation (RoSA) inspired by robust principal component analysis that jointly trains $\textit{low-rank}$ and $\textit{highly-sparse}$ components on top of a set of fixed pretrained weights to efficiently approximate the performance of a full-fine-tuning (FFT) solution. Across a series of challenging generative tasks such as grade-school math and SQL query generation, which require fine-tuning for good performance, we show that RoSA outperforms LoRA, pure sparse fine-tuning, and alternative hybrid methods at the same parameter budget, and can even recover the performance of FFT on some tasks. We provide system support for RoSA to complement the training algorithm, specifically in the form of sparse GPU kernels which enable memory- and computationally-efficient training, and show that it is also compatible with low-precision base weights, resulting in the first joint representation combining quantization, low-rank and sparse approximations. Our code is available at https://github.com/IST-DASLab/RoSA.

PDF Abstract

Robust Adaptation (RoSA): Accurate Parameter-Efficient Fine-Tuning for LLMs

The paper "RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation" presents an advanced method for parameter-efficient fine-tuning (PEFT) of LLMs. The primary aim of the paper is to enhance the fine-tuning accuracy of LLMs while maintaining computational and memory efficiency under constrained resources. The key contribution of this work is the introduction of Robust Adaptation (RoSA), a PEFT method that integrates low-rank and sparse components to achieve performance levels close to full-fine-tuning (FFT).

Problem Context and Motivation

Training LLMs from scratch is highly resource-intensive, necessitating vast computational and memory resources. Fine-tuning these models for specific tasks by adjusting only a partial set of parameters has become a popular approach due to its cost efficiency. However, the current PEFT methods, such as Low-Rank Adaptation (LoRA), often fall short in accuracy compared to FFT, particularly on more complex tasks. This issue sparks the fundamental question: can we design a PEFT method that marries the simplicity and practicality of LoRA-type methods with the high accuracy of FFT?

Core Contribution: RoSA

RoSA addresses the aforementioned shortcoming by leveraging the principles of robust principal component analysis (RPCA). Traditional LoRA methods assume a low "intrinsic rank" of parameter updates, which can be insufficient for complex tasks. Through a detailed investigation, the authors discover that a combination of low-rank and sparse matrices offers a significantly better approximation. RoSA's novelty lies in its dual approach, concurrently training low-rank and sparse components.

Methodology

Dual Adaptation Components: RoSA aims to optimize the following parameter perturbation:
- Low-Rank Component ( $\Delta^L$ ): This pertains to low-rank matrices trained similarly to LoRA.
- Sparse Component ( $\Delta^S$ ): Sparse matrices aim to capture the outlier components missed by the low-rank approximation.
System Implementation: The authors designed sparse GPU kernels to support the training framework efficiently. These kernels handle sparse matrix operations and facilitate low-precision base weights, enabling substantial savings in both memory and computational loads.
Quantization Compatibility: RoSA's formulation is extended to integrate with 4-bit quantized weights (QRoSA), further enhancing its memory efficiency without hampering accuracy.

Experimental Evaluation

The paper's experimental setup is extensive, targeting a variety of challenging generative tasks including grade-school math questions (GSM8k), SQL query generation, and natural language processing datasets (ViGGO). The evaluations span single-epoch and extended training scenarios, comparing RoSA with LoRA, sparse adaptation (SpA), and FFT.

Key Findings

Accuracy Performance: RoSA consistently outperforms both LoRA and SpA methods under similar parameter budgets. Notably, for certain task datasets, RoSA achieves accuracy near or equal to FFT, showcasing its robustness and efficiency.
Efficiency: The enhanced system support ensures that RoSA operates within the same memory constraints as LoRA, making it practical for real-world applications. The sparse component of RoSA is finely tuned through a gradient-based mask selection mechanism, further optimizing its parameter efficiency.
Quantized Weights: In QRoSA experiments, the method exhibits competitive, if not superior, performance compared to other adaptations, highlighting its suitability for memory-constrained environments.

Theoretical and Practical Implications

From a theoretical standpoint, RoSA bridges the gap between robust PCA principles and practical adaptation requirements in LLMs. It introduces a hybrid approach that scales effectively with task complexity, advocating the need for adaptable parameter partitions.

Practically, RoSA's implementation in widely used libraries such as PyTorch democratizes access to efficient fine-tuning methods. This advancement could pave the way for deploying advanced LLM tasks on consumer-grade GPUs, broadening the applicability of sophisticated AI models.

Future Directions

The findings open several avenues for future research. Investigating the interplay of RoSA with different neural network architectures, extending it to other PEFT scenarios, and refining the system-level optimization can provide deeper insights. Furthermore, expanding RoSA's utility in few-shot and zero-shot learning contexts, where parameter efficiency is critical, would be a particularly interesting direction.

In conclusion, the introduction of RoSA represents a significant step towards bridging the accuracy gap between parameter-efficient methods and full-fine-tuning in LLMs. Its blend of theoretical innovation and practical efficiency offers a robust framework for advancing the state of fine-tuning in LLMs, with promising implications for both research and application.