HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy (2401.15207v3)

Published 26 Jan 2024 in cs.LG and cs.CL

Abstract: Full-parameter fine-tuning has become the go-to choice for adapting LLMs (LMs) to downstream tasks due to its excellent performance. As LMs grow in size, fine-tuning the full parameters of LMs requires a prohibitively large amount of GPU memory. Existing approaches utilize zeroth-order optimizer to conserve GPU memory, which can potentially compromise the performance of LMs as non-zero order optimizers tend to converge more readily on most downstream tasks. In this paper, we propose a novel optimizer-independent end-to-end hierarchical fine-tuning strategy, HiFT, which only updates a subset of parameters at each training step. HiFT can significantly reduce the amount of gradients and optimizer state parameters residing in GPU memory at the same time, thereby reducing GPU memory usage. Our results demonstrate that: (1) HiFT achieves comparable performance to parameter-efficient fine-tuning and standard full parameter fine-tuning. (2) HiFT supports various optimizers including AdamW, AdaGrad, SGD, etc. (3) HiFT can save more than 60\% GPU memory compared with standard full-parameter fine-tuning for 7B model. (4) HiFT enables full-parameter fine-tuning of a 7B model on single 48G A6000 with a precision of 32 using the AdamW optimizer, without using any memory saving techniques.

PDF HTML Abstract

Introduction

Full-Parameter Fine-Tuning (FPFT) of LLMs has been a predominant approach for achieving superior performance across various downstream tasks. However, the luxury of FPFT comes at a cost of exorbitant GPU memory consumption, posing a substantial barrier for research involving increasingly larger models. The pursuit to mitigate these memory constraints without compromising the performance has led to the development of several techniques, but these often involve complex trade-offs. This paper discusses a novel strategy that challenges the status quo by presenting a Hierarchical Fine-Tuning (HiFT) technique that offers a significant reduction in GPU memory usage while maintaining the quality of FPFT.

Related Work

Prior attempts to address the memory challenge have included strategies such as utilizing heterogeneous memory and parallel techniques, which often produce a communication burden. Parameter-Efficient Fine-Tuning (PEFT) solutions, including addition-based, selection-based, and reparametrization-based methods, offer an alternative but come at the expense of a performance gap compared to FPFT. Concurrently, there has been research focusing on Memory-Efficient Fine-tuning (MEFT) techniques. Approaches like the zeroth-order optimizer and LOMO challenge the traditional need for optimizer state parameters but preclude using momentum optimizers such as AdamW, otherwise known to be effective.

HiFT Approach

HiFT propels past the limitations of previous approaches by structuring a hierarchical parameter update mechanism. It divides the model's layers into different groups and updates only one group at a time, effectively reducing the parameters that need to be actively stored in GPU memory during each training step. Unlike layer-wise training which may result in accumulated error, HiFT's end-to-end strategy updates parameters in a manner that respects the established network structure. The proposed algorithm demonstrates a drastic reduction in the memory footprint of trainable parameters, gradients, and optimizer states, enabling the fine-tuning of monumental models on modestly equipped hardware.

Furthermore, HiFT allows the utilization of various optimizers, offering flexibility in optimizer choice, a significant advantage over earlier MEFT methods. Interestingly, HiFT introduces three update strategies—bottom2up, top2bottom, and random—providing an assortment of pathways for updating grouped parameters, further reinforcing its adaptability.

Experimental Results

Experimental validations against benchmarks such as GLUE and SuperGLUE illustrate that HiFT matches or outperforms both standard FPFT and other PEFT methods in terms of model performance. Impressively, it enables FPFT of a 7B model on a single 48G A6000 GPU without incurring additional memory-saving mechanisms. In terms of memory profiling, HiFT exhibits up to 60% memory savings compared to standard FPFT across various model scales. The use of different training strategies also reveals an intriguing stability in performance, irrespective of the update order, hinting at the robustness of HiFT's structure. Importantly, the paper addresses prospective concerns about the learning rate updates through a delayed update strategy that helps maintain model update consistency.

Conclusion

The proposed HiFT aptly meets the challenge of large model fine-tuning under memory constraints by delivering an asynchronous hierarchical pattern of model updates. It not only promises substantial GPU memory savings but also offers scalable performance, operational flexibility with different optimizers, and harbors potential for future large-scale model parallelism development. This research marks a commendable step forward in the fine-tuning landscape, easing the task of adapting LLMs to specific domains while conserving computational resources.

PDF Markdown Bookmark Chat (Pro)

References (53)

Authors (8)

Yongkang Liu (35 papers)
Yiqun Zhang (27 papers)
Qian Li (236 papers)
Shi Feng (95 papers)
Daling Wang (35 papers)
Yifei Zhang (167 papers)
Hinrich Schütze (250 papers)
Tong Liu (316 papers)

Citations (3)

View on Semantic Scholar

Tweets

https://twitter.com/liuygkg/status/1752332195224588390

https://twitter.com/CisLmu/status/1855675748478255401