Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sparse Matrix in Large Language Model Fine-tuning (2405.15525v2)

Published 24 May 2024 in cs.CL

Abstract: LoRA and its variants have become popular parameter-efficient fine-tuning (PEFT) methods due to their ability to avoid excessive computational costs. However, an accuracy gap often exists between PEFT methods and full fine-tuning (FT), and this gap has yet to be systematically studied. In this work, we introduce a method for selecting sparse sub-matrices that aim to minimize the performance gap between PEFT vs. full fine-tuning (FT) while also reducing both fine-tuning computational cost and memory cost. Our Sparse Matrix Tuning (SMT) method begins by identifying the most significant sub-matrices in the gradient update, updating only these blocks during the fine-tuning process. In our experiments, we demonstrate that SMT consistently surpasses other PEFT baseline (e.g. LoRA and DoRA) in fine-tuning popular LLMs such as LLaMA across a broad spectrum of tasks, while reducing the GPU memory footprint by 67% compared to FT. We also examine how the performance of LoRA and DoRA tends to plateau and decline as the number of trainable parameters increases, in contrast, our SMT method does not suffer from such issue.

Sparse Matrix Tuning: An Advanced Parameter-Efficient Fine-Tuning Approach

LLMs have demonstrated impressive generalization capabilities. However, fine-tuning these models for specific downstream tasks remains challenging due to significant computational and memory requirements. Sparse Matrix Tuning (SMT), as introduced in this paper, offers a novel parameter-efficient fine-tuning (PEFT) method that addresses these challenges, outperforming established methods such as LoRA and DoRA.

Summary of Methodology

The core innovation of SMT lies in its strategic selection of sparse sub-matrices within the model's weight matrices for fine-tuning. SMT identifies the most significant sub-matrices based on gradient updates during an initial warm-up phase and focuses on these during the fine-tuning process. This selective approach results in both lower computational and memory costs, particularly during backward propagation and parameter updates.

Key Methodological Contributions:

  1. Sparse Sub-Matrix Selection: SMT utilizes a warm-up phase to identify sub-matrices in the model weights that exhibit significant gradient changes. These sub-matrices are then the focus during fine-tuning.
  2. Efficient Backpropagation: By updating only selected sub-matrices, SMT reduces the computational load of backward propagation to just 0.5% of full fine-tuning.
  3. Reduced Memory Footprint: SMT lowers the optimizer memory requirements and activation memory costs by focusing on smaller portions of the model, allowing fine-tuning to fit within the constraints of consumer-level GPUs.

Experimental Results

The researchers conducted comprehensive experiments on various LLaMA models, including LLaMA-7B, LLaMA-13B, LLaMA2-7B, and LLaMA3-8B, across a spectrum of tasks such as commonsense reasoning and arithmetic reasoning.

Performance Highlights:

  • Outperformance of SOTA Methods: SMT consistently surpassed the performance of LoRA and DoRA across multiple datasets. For example, SMT showed an average improvement of 2-3% in accuracy over these baseline methods in commonsense reasoning tasks.
  • Scalability: Unlike LoRA and DoRA, which exhibited performance plateaus and declines with increasing numbers of trainable parameters, SMT maintained and even improved performance, demonstrating its robustness and scalability.
  • Resource Efficiency: SMT achieved a 67% reduction in GPU memory footprint compared to full fine-tuning, fitting fine-tuning tasks into GPUs as modest as the NVIDIA RTX 4090.

Theoretical Insights

SMT's empirical findings challenge prevailing assumptions about the components of LLMs that are most critical for downstream task performance. Contrary to previous claims suggesting the dominance of MLP layers, SMT showed that attention mechanisms, particularly the value vectors (V), hold the majority of the influential information.

Practical Implications

The practical implications of SMT are significant, particularly in settings where computational resources are a constraint. By enabling efficient fine-tuning on consumer-level hardware, SMT opens up advanced LLM fine-tuning to a broader range of users and applications. Additionally, the substantial reductions in computational and memory costs can lead to faster iteration cycles and lower operational costs in deploying fine-tuned LLMs in production environments.

Future Directions

The findings from this paper pave the way for future research in several areas:

  • Refinement of Sparse Selection: Automating and possibly enhancing the initial warm-up phase to dynamically adjust the sparsity level could further optimize performance and resource usage.
  • Extension to Other Models: Applying SMT to a wider array of model architectures beyond the LLaMA series to validate its generalizability.
  • Hybrid Methods: Combining SMT with other PEFT approaches to leverage the strengths of multiple methods for even greater efficiency and performance.

In conclusion, SMT represents a significant advancement in the field of parameter-efficient fine-tuning of LLMs. By focusing on the most influential sub-matrices, SMT not only bridges the performance gap between PEFT methods and full fine-tuning but also introduces substantial computational and memory efficiencies. These qualities make SMT a valuable contribution to the ongoing development of more efficient and scalable AI technologies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Haoze He (9 papers)
  2. Juncheng Billy Li (1 paper)
  3. Xuan Jiang (7 papers)
  4. Heather Miller (16 papers)
Citations (2)