Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy (2401.15207v3)

Published 26 Jan 2024 in cs.LG and cs.CL
HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy

Abstract: Full-parameter fine-tuning has become the go-to choice for adapting LLMs (LMs) to downstream tasks due to its excellent performance. As LMs grow in size, fine-tuning the full parameters of LMs requires a prohibitively large amount of GPU memory. Existing approaches utilize zeroth-order optimizer to conserve GPU memory, which can potentially compromise the performance of LMs as non-zero order optimizers tend to converge more readily on most downstream tasks. In this paper, we propose a novel optimizer-independent end-to-end hierarchical fine-tuning strategy, HiFT, which only updates a subset of parameters at each training step. HiFT can significantly reduce the amount of gradients and optimizer state parameters residing in GPU memory at the same time, thereby reducing GPU memory usage. Our results demonstrate that: (1) HiFT achieves comparable performance to parameter-efficient fine-tuning and standard full parameter fine-tuning. (2) HiFT supports various optimizers including AdamW, AdaGrad, SGD, etc. (3) HiFT can save more than 60\% GPU memory compared with standard full-parameter fine-tuning for 7B model. (4) HiFT enables full-parameter fine-tuning of a 7B model on single 48G A6000 with a precision of 32 using the AdamW optimizer, without using any memory saving techniques.

Introduction

Full-Parameter Fine-Tuning (FPFT) of LLMs has been a predominant approach for achieving superior performance across various downstream tasks. However, the luxury of FPFT comes at a cost of exorbitant GPU memory consumption, posing a substantial barrier for research involving increasingly larger models. The pursuit to mitigate these memory constraints without compromising the performance has led to the development of several techniques, but these often involve complex trade-offs. This paper discusses a novel strategy that challenges the status quo by presenting a Hierarchical Fine-Tuning (HiFT) technique that offers a significant reduction in GPU memory usage while maintaining the quality of FPFT.

Related Work

Prior attempts to address the memory challenge have included strategies such as utilizing heterogeneous memory and parallel techniques, which often produce a communication burden. Parameter-Efficient Fine-Tuning (PEFT) solutions, including addition-based, selection-based, and reparametrization-based methods, offer an alternative but come at the expense of a performance gap compared to FPFT. Concurrently, there has been research focusing on Memory-Efficient Fine-tuning (MEFT) techniques. Approaches like the zeroth-order optimizer and LOMO challenge the traditional need for optimizer state parameters but preclude using momentum optimizers such as AdamW, otherwise known to be effective.

HiFT Approach

HiFT propels past the limitations of previous approaches by structuring a hierarchical parameter update mechanism. It divides the model's layers into different groups and updates only one group at a time, effectively reducing the parameters that need to be actively stored in GPU memory during each training step. Unlike layer-wise training which may result in accumulated error, HiFT's end-to-end strategy updates parameters in a manner that respects the established network structure. The proposed algorithm demonstrates a drastic reduction in the memory footprint of trainable parameters, gradients, and optimizer states, enabling the fine-tuning of monumental models on modestly equipped hardware.

Furthermore, HiFT allows the utilization of various optimizers, offering flexibility in optimizer choice, a significant advantage over earlier MEFT methods. Interestingly, HiFT introduces three update strategies—bottom2up, top2bottom, and random—providing an assortment of pathways for updating grouped parameters, further reinforcing its adaptability.

Experimental Results

Experimental validations against benchmarks such as GLUE and SuperGLUE illustrate that HiFT matches or outperforms both standard FPFT and other PEFT methods in terms of model performance. Impressively, it enables FPFT of a 7B model on a single 48G A6000 GPU without incurring additional memory-saving mechanisms. In terms of memory profiling, HiFT exhibits up to 60% memory savings compared to standard FPFT across various model scales. The use of different training strategies also reveals an intriguing stability in performance, irrespective of the update order, hinting at the robustness of HiFT's structure. Importantly, the paper addresses prospective concerns about the learning rate updates through a delayed update strategy that helps maintain model update consistency.

Conclusion

The proposed HiFT aptly meets the challenge of large model fine-tuning under memory constraints by delivering an asynchronous hierarchical pattern of model updates. It not only promises substantial GPU memory savings but also offers scalable performance, operational flexibility with different optimizers, and harbors potential for future large-scale model parallelism development. This research marks a commendable step forward in the fine-tuning landscape, easing the task of adapting LLMs to specific domains while conserving computational resources.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. Composable sparse fine-tuning for cross-lingual transfer. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1778–1796, 2022.
  2. Fine-Tuning LLMs: LoRA or Full-Parameter? An in-depth Analysis with Llama-2. 2023. URL https://lightning.ai/pages/community/lora-insights.
  3. Greedy layer-wise training of deep networks. Advances in neural information processing systems, 19, 2006.
  4. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, 2015.
  5. Attention fusion: a light yet efficient late fusion mechanism for task adaptation in nlu. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 857–866, 2022.
  6. Semeval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pages 1–14, 2017.
  7. Parameter-efficient fine-tuning design spaces. arXiv preprint arXiv:2301.01821, 2023.
  8. Training deep nets with sublinear memory cost. ArXiv, abs/1604.06174, 2016. URL https://api.semanticscholar.org/CorpusID:15865278.
  9. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, 2019.
  10. Krona: Parameter efficient tuning with kronecker adapter. arXiv preprint arXiv:2212.10650, 2022.
  11. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  12. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022a.
  13. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022b. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  14. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035, 2021.
  15. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 252–262, 2018.
  16. Bpipe: Memory-balanced pipeline parallelism for training large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 16639–16653. PMLR, 2023.
  17. Fine-Tuning Llama-2: A Comprehensive Case Study for Tailoring Models to Unique Applications. 2023. URL https://lightning.ai/pages/community/lora-insights.
  18. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021.
  19. The winograd schema challenge. In Thirteenth international conference on the principles of knowledge representation and reasoning, 2012.
  20. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 7871–7880. Association for Computational Linguistics, 2020.
  21. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 4582–4597. Association for Computational Linguistics, 2021a.
  22. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021b.
  23. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647, 2023.
  24. Multilingual denoising pre-training for neural machine translation. Trans. Assoc. Comput. Linguistics, 8:726–742, 2020.
  25. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017. URL https://api.semanticscholar.org/CorpusID:53592270.
  26. Full parameter fine-tuning for large language models with limited resources. CoRR, abs/2306.09782, 2023.
  27. Fine-tuning language models with just forward passes. CoRR, abs/2305.17333, 2023.
  28. Efficient large-scale language model training on GPU clusters using megatron-lm. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021, page 58. ACM, 2021.
  29. The e2e dataset: New challenges for end-to-end generation. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 201–206, 2017.
  30. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1267–1273, 2019.
  31. Training large neural networks with constant memory using a new execution algorithm. CoRR, abs/2002.05645, 2020.
  32. Zero: memory optimizations toward training trillion parameter models. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page 20. IEEE/ACM, 2020a.
  33. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020b.
  34. Zero-infinity: breaking the GPU memory wall for extreme scale deep learning. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021, page 59. ACM, 2021.
  35. Know what you don’t know: Unanswerable questions for squad. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, 2018.
  36. Sebastian Raschka. Finetuning LLMs with LoRA and QLoRA: Insights from Hundreds of Experiments. 2023. URL https://lightning.ai/pages/community/lora-insights.
  37. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
  38. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In 2011 AAAI Spring Symposium Series, 2011.
  39. Mesh-tensorflow: Deep learning for supercomputers. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pages 10435–10444, 2018.
  40. Staged training for transformer language models. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 19893–19908. PMLR, 2022.
  41. Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053, 2019.
  42. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631–1642, 2013.
  43. A comparative study between full-parameter and lora-based fine-tuning on chinese instruction data for instruction following large language model. CoRR, abs/2304.08109, 2023.
  44. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  45. Building a question answering test collection. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2000. URL https://api.semanticscholar.org/CorpusID:11465263.
  46. Efficient fine-tuning of bert models on the edge. In 2022 IEEE International Symposium on Circuits and Systems (ISCAS), pages 1838–1842. IEEE, 2022.
  47. Glue: A multi-task benchmark and analysis platform for natural language understanding. In BlackboxNLPEMNLP, 2018. URL https://api.semanticscholar.org/CorpusID:5034059.
  48. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
  49. Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7:625–641, 2019.
  50. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of NAACL-HLT, pages 1112–1122, 2018.
  51. YUAN 2.0: A large language model with localized filtering-based attention. CoRR, abs/2311.15786, 2023.
  52. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022.
  53. Mpipemoe: Memory efficient moe for pre-trained models with adaptive pipeline parallelism. In IEEE International Parallel and Distributed Processing Symposium, IPDPS 2023, St. Petersburg, FL, USA, May 15-19, 2023, pages 167–177. IEEE.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yongkang Liu (35 papers)
  2. Yiqun Zhang (27 papers)
  3. Qian Li (236 papers)
  4. Shi Feng (95 papers)
  5. Daling Wang (35 papers)
  6. Yifei Zhang (167 papers)
  7. Hinrich Schütze (250 papers)
  8. Tong Liu (316 papers)
Citations (3)