Introduction
Full-Parameter Fine-Tuning (FPFT) of LLMs has been a predominant approach for achieving superior performance across various downstream tasks. However, the luxury of FPFT comes at a cost of exorbitant GPU memory consumption, posing a substantial barrier for research involving increasingly larger models. The pursuit to mitigate these memory constraints without compromising the performance has led to the development of several techniques, but these often involve complex trade-offs. This paper discusses a novel strategy that challenges the status quo by presenting a Hierarchical Fine-Tuning (HiFT) technique that offers a significant reduction in GPU memory usage while maintaining the quality of FPFT.
Related Work
Prior attempts to address the memory challenge have included strategies such as utilizing heterogeneous memory and parallel techniques, which often produce a communication burden. Parameter-Efficient Fine-Tuning (PEFT) solutions, including addition-based, selection-based, and reparametrization-based methods, offer an alternative but come at the expense of a performance gap compared to FPFT. Concurrently, there has been research focusing on Memory-Efficient Fine-tuning (MEFT) techniques. Approaches like the zeroth-order optimizer and LOMO challenge the traditional need for optimizer state parameters but preclude using momentum optimizers such as AdamW, otherwise known to be effective.
HiFT Approach
HiFT propels past the limitations of previous approaches by structuring a hierarchical parameter update mechanism. It divides the model's layers into different groups and updates only one group at a time, effectively reducing the parameters that need to be actively stored in GPU memory during each training step. Unlike layer-wise training which may result in accumulated error, HiFT's end-to-end strategy updates parameters in a manner that respects the established network structure. The proposed algorithm demonstrates a drastic reduction in the memory footprint of trainable parameters, gradients, and optimizer states, enabling the fine-tuning of monumental models on modestly equipped hardware.
Furthermore, HiFT allows the utilization of various optimizers, offering flexibility in optimizer choice, a significant advantage over earlier MEFT methods. Interestingly, HiFT introduces three update strategies—bottom2up, top2bottom, and random—providing an assortment of pathways for updating grouped parameters, further reinforcing its adaptability.
Experimental Results
Experimental validations against benchmarks such as GLUE and SuperGLUE illustrate that HiFT matches or outperforms both standard FPFT and other PEFT methods in terms of model performance. Impressively, it enables FPFT of a 7B model on a single 48G A6000 GPU without incurring additional memory-saving mechanisms. In terms of memory profiling, HiFT exhibits up to 60% memory savings compared to standard FPFT across various model scales. The use of different training strategies also reveals an intriguing stability in performance, irrespective of the update order, hinting at the robustness of HiFT's structure. Importantly, the paper addresses prospective concerns about the learning rate updates through a delayed update strategy that helps maintain model update consistency.
Conclusion
The proposed HiFT aptly meets the challenge of large model fine-tuning under memory constraints by delivering an asynchronous hierarchical pattern of model updates. It not only promises substantial GPU memory savings but also offers scalable performance, operational flexibility with different optimizers, and harbors potential for future large-scale model parallelism development. This research marks a commendable step forward in the fine-tuning landscape, easing the task of adapting LLMs to specific domains while conserving computational resources.