LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning (2403.17919v3)

Published 26 Mar 2024 in cs.LG, cs.AI, cs.CL, and math.OC

Abstract: The machine learning community has witnessed impressive advancements since LLMs first appeared. Yet, their massive memory consumption has become a significant roadblock to large-scale training. For instance, a 7B model typically requires at least 60 GB of GPU memory with full parameter training, which presents challenges for researchers without access to high-resource environments. Parameter Efficient Fine-Tuning techniques such as Low-Rank Adaptation (LoRA) have been proposed to alleviate this problem. However, in most large-scale fine-tuning settings, their performance does not reach the level of full parameter training because they confine the parameter search to a low-rank subspace. Attempting to complement this deficiency, we investigate the layerwise properties of LoRA on fine-tuning tasks and observe an unexpected but consistent skewness of weight norms across different layers. Utilizing this key observation, a surprisingly simple training strategy is discovered, which outperforms both LoRA and full parameter training in a wide range of settings with memory costs as low as LoRA. We name it Layerwise Importance Sampled AdamW (LISA), a promising alternative for LoRA, which applies the idea of importance sampling to different layers in LLMs and randomly freezes most middle layers during optimization. Experimental results show that with similar or less GPU memory consumption, LISA surpasses LoRA or even full parameter tuning in downstream fine-tuning tasks, where LISA consistently outperforms LoRA by over 10%-35% in terms of MT-Bench score while achieving on-par or better performance in MMLU, AGIEval and WinoGrande. On large models, specifically LLaMA-2-70B, LISA surpasses LoRA on MT-Bench, GSM8K, and PubMedQA, demonstrating its effectiveness across different domains.

PDF Abstract

LISA: A Novel Approach for Efficient LLM Fine-Tuning

Introduction to LISA

The quest for enhancing the efficiency of fine-tuning LLMs has led to the development of the Layerwise Importance Sampled AdamW (LISA). This approach targets a significant hurdle in the utilization of LLMs: the excessive memory consumption during large-scale training. While existing Parameter Efficient Fine-Tuning (PEFT) techniques, notably Low-Rank Adaptation (LoRA), have made strides in addressing this issue, they have not consistently outperformed full parameter training across all settings. LISA emerges as a strategic alternative by leveraging the layerwise properties of LoRA to optimize memory usage and training performance.

Motivation and Key Observations

The motivation behind LISA stems from an insightful analysis of LoRA's performance across different layers of LLMs. A notable skewness was observed in the weight norms across layers when employing LoRA for fine-tuning tasks. This uneven distribution of weight norms suggests a varied importance of layers in the training process—a foundational observation that inspired the development of LISA. By applying the concept of importance sampling strategically to LLM layers, LISA selectively updates only crucial layers, thereby significantly reducing memory consumption while enhancing or maintaining training effectiveness.

The LISA Algorithm

LISA operates by applying AdamW optimization selectively across layers based on predetermined probabilities, thereby freezing a majority of the middle layers during optimization. This selective updating process is designed to closely emulate LoRA's skewed updating pattern but without the inherent limitations tied to LoRA's low-rank space. Experimental results have bolstered LISA's potential, demonstrating its capability to outperform both LoRA and full parameter training across various settings with lower or similar memory costs.

Experimental Evaluation and Results

Extensive evaluations reveal LISA's impressive performance in fine-tuning tasks for modern LLMs. It consistently outperformed LoRA by over 11% to 37% in terms of MT-Bench scores and exhibited superior performance on large models like LLaMA-2-70B across different domains, including instruction following, medical QA, and math problems. Furthermore, LISA showed remarkable memory efficiency, enabling the training of models up to 70B parameters with reduced GPU memory consumption in comparison to LoRA.

Implications and Future Directions

The introduction of LISA marks a significant advancement in the field of LLM fine-tuning. Its memory-efficient training strategy offers a practical solution to the challenges associated with large-scale LLM training. The strong numerical results and the ability to surpass existing PEFT techniques underscore LISA's potential as a promising tool for future research and applications involving LLMs. Looking ahead, further exploration into optimized layerwise importance sampling strategies and extending LISA's application to even larger models are promising directions for extending LISA's utility in the field of AI.

In summary, LISA's innovative approach to layerwise importance sampling represents a notable leap forward in the efficient and effective fine-tuning of LLMs. Its ability to conserve memory while delivering improved performance metrics opens new avenues for research and practical applications of LLMs across various domains.