Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

1.1k

PiSSA: Principal Singular Values and Singular Vectors Adaptation of Large Language Models (2404.02948v3)

Published 3 Apr 2024 in cs.LG and cs.AI

Abstract: To parameter-efficiently fine-tune (PEFT) LLMs, the low-rank adaptation (LoRA) method approximates the model changes $\Delta W \in \mathbb{R}^{m \times n}$ through the product of two matrices $A \in \mathbb{R}^{m \times r}$ and $B \in \mathbb{R}^{r \times n}$, where $r \ll \min(m, n)$, $A$ is initialized with Gaussian noise, and $B$ with zeros. LoRA freezes the original model $W$ and updates the "Noise & Zero" adapter, which may lead to slow convergence. To overcome this limitation, we introduce Principal Singular values and Singular vectors Adaptation (PiSSA). PiSSA shares the same architecture as LoRA, but initializes the adaptor matrices $A$ and $B$ with the principal components of the original matrix $W$, and put the remaining components into a residual matrix $W^{res} \in \mathbb{R}^{m \times n}$ which is frozen during fine-tuning. Compared to LoRA, PiSSA updates the principal components while freezing the "residual" parts, allowing faster convergence and enhanced performance. Comparative experiments of PiSSA and LoRA across 12 different models, ranging from 184M to 70B, encompassing 5 NLG and 8 NLU tasks, reveal that PiSSA consistently outperforms LoRA under identical experimental setups. On the GSM8K benchmark, Mistral-7B fine-tuned with PiSSA achieves an accuracy of 72.86%, surpassing LoRA's 67.7% by 5.16%. Due to the same architecture, PiSSA is also compatible with quantization to further reduce the memory requirement of fine-tuning. Compared to QLoRA, QPiSSA (PiSSA with 4-bit quantization) exhibits smaller quantization errors in the initial stages. Fine-tuning LLaMA-3-70B on GSM8K, QPiSSA attains an accuracy of 86.05%, exceeding the performances of QLoRA at 81.73%. Leveraging a fast SVD technique, PiSSA can be initialized in only a few seconds, presenting a negligible cost for transitioning from LoRA to PiSSA.

PDF HTML Abstract

PiSSA: Enhancing LLMs via Principal Singular values and Singular vectors Adaptation

Introduction to PiSSA

Recent advancements in the field of LLMs, notably their efficacy in diverse tasks, have led to an escalated interest in fine-tuning methodologies. Given the prohibitive computational costs associated with full-model fine-tuning of LLMs, parameter-efficient fine-tuning (PEFT) methods have emerged. Among these, the Principal Singular values and Singular vectors Adaptation (PiSSA) has been introduced as a novel technique. PiSSA leverages the low intrinsic dimensionality of pretrained LLMs, enabling the optimization of a smaller parameter space, thus achieving or surpassing full-parameter fine-tuning performance with significantly less computational overhead. This is primarily achieved by initializing two trainable matrices, $A$ and $B$ , with the principal singular values and singular vectors of the matrix $W$ in the model, supplemented by a frozen residual matrix for error correction.

Theoretical Foundations and Related Works

PiSSA is grounded in the hypothesis, similar to that of Intrinsic Singular Value Decomposition (SVD) and Low-Rank Adaptation (LoRA), that changes in model parameters during fine-tuning exhibit low-rank characteristics. Diverging from LoRA's approach of approximating the changes in $W$ through random initialization, PiSSA employs a primary decomposition of $W$ into its principal components for initialization. This orientation allows for a quicker and more effective approximation of full-parameter fine-tuning outcomes by modifying essential parts of $W$ and freezing "noisy" components, demonstrating a nuanced transformation from conventional PEFT techniques.

Methodology

PiSSA's methodological framework involves the decomposition of a pretrained model's weight matrices using SVD to identify its principal singular values and singular vectors. These are used to initialize the trainable matrices $A$ and $B$ , which, along with the residual matrix $W^{res}$ , approximate the original matrix $W$ while significantly reducing the number of trainable parameters.

The decomposition enables the separation of essential components (captured by $A$ and $B$ ) from residual ones (captured in $W^{res}$ ), focusing fine-tuning efforts on the model's intrinsic, low-dimensional structure.
In practice, PiSSA enables quicker convergence and improved performance compared to methods like LoRA by maintaining a focused tuning on matrices that encapsulate the model's primary capabilities.

Experimental Validation

Through extensive experiments involving three LLMs across a variety of tasks, PiSSA has been demonstrated to not only accelerate convergence compared to LoRA but also to effectively approximate full fine-tuning performance with considerably fewer trainable parameters.

Key achievements include a significant outperformance of LoRA across multiple benchmarks and models, substantiated by strong numerical results such as achieving a 72.86% accuracy on the GSM8K benchmark with Mistral-7B, outperforming LoRA's 67.7% accuracy.
The experiments underscore the viability of PiSSA in encompassing the advantages of LoRA while addressing its limitations through a focused fine-tuning of primary model components.

Practical Implications and Future Outlook

The PiSSA methodology inherits and enhances the operational benefits of LoRA, including parameter efficiency and compatibility with model quantization, while introducing an innovative approach to fine-tuning LLMs. The distinct initialization strategy prioritizing principal model components promises broader applicability in tasks requiring the adaptation of LLMs to specific domains or requirements.

The compatibility of PiSSA with existing LLM architectures and its methodological benefits suggest a promising direction for future research in PEFT, including exploring the application of PiSSA across an even broader range of models and tasks.
Potential future developments might focus on the integration of PiSSA with advanced model compression techniques or exploring theoretical frameworks to further elucidate the mechanisms behind its efficiency and effectiveness.

In conclusion, PiSSA presents a significant advancement in the fine-tuning of LLMs, offering a practical, efficient, and effective method for leveraging the intrinsic structural properties of pretrained models to achieve superior performance across a range of tasks. Its methodological nuances and experimental successes highlight its potential as a cornerstone in the ongoing development of PEFT techniques for LLMs.

PDF Markdown Bookmark Chat (Pro)

References (57)

Authors (3)

Fanxu Meng (26 papers)
Zhaohui Wang (30 papers)
Muhan Zhang (89 papers)

Citations (41)

View on Semantic Scholar

Tweets

https://twitter.com/danielhanchen/status/1854992153992479165

https://twitter.com/arankomatsuzaki/status/1777453185675973104

https://twitter.com/maxisawesome538/status/1801724228703764663

https://twitter.com/hoenogatari/status/1780975417920270380

https://twitter.com/sheggle_/status/1780514908230033478

https://twitter.com/XunguangWang/status/1776808340078899581