Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Full Parameter Fine-tuning for Large Language Models with Limited Resources (2306.09782v2)

Published 16 Jun 2023 in cs.CL
Full Parameter Fine-tuning for Large Language Models with Limited Resources

Abstract: LLMs have revolutionized NLP but demand massive GPU resources for training. Lowering the threshold for LLMs training would encourage greater participation from researchers, benefiting both academia and society. While existing approaches have focused on parameter-efficient fine-tuning, which tunes or adds a small number of parameters, few have addressed the challenge of tuning the full parameters of LLMs with limited resources. In this work, we propose a new optimizer, LOw-Memory Optimization (LOMO), which fuses the gradient computation and the parameter update in one step to reduce memory usage. By integrating LOMO with existing memory saving techniques, we reduce memory usage to 10.8% compared to the standard approach (DeepSpeed solution). Consequently, our approach enables the full parameter fine-tuning of a 65B model on a single machine with 8 RTX 3090, each with 24GB memory.Code and data are available at https://github.com/OpenLMLab/LOMO.

Full Parameter Fine-tuning for LLMs with Limited Resources

Introduction

The paper "Full Parameter Fine-tuning for LLMs with Limited Resources" by Kai Lv, Yuqing Yang, Tengxiao Liu, Qinghui Gao, Qipeng Guo, and Xipeng Qiu addresses a critical bottleneck in the field of NLP: the significant hardware requirements necessary to fine-tune LLMs. The authors introduce a new optimization technique, LOw-Memory Optimization (LOMO), which aims to facilitate full parameter fine-tuning of LLMs without necessitating extensive hardware resources.

Background

Current LLMs, including those with 30 billion to 175 billion parameters, necessitate extensive GPU memory for fine-tuning, often requiring multiple high-capacity GPUs, thereby limiting accessibility for smaller research labs. Previous approaches, such as parameter-efficient fine-tuning methods like LoRA and Prefix-tuning, while reducing resource requirements, fail to provide the comprehensive adaptability and performance that full parameter fine-tuning offers.

Contributions

The primary contributions of this work are as follows:

  1. Theoretical Analysis: The authors provide a theoretical foundation suggesting that Stochastic Gradient Descent (SGD) is competent for fine-tuning LLMs. Importantly, the typical challenges of SGD, such as handling large curvature loss surfaces, local optima, and saddle points, are less problematic in the smooth parameter spaces of LLMs.
  2. LOMO Optimizer: LOMO, the proposed optimizer, merges gradient computation with parameter updates, thus minimizing gradient tensor memory usage to the scale of the largest single gradient tensor rather than the cumulative size of all gradients.
  3. Memory Efficiency: A thorough empirical evaluation demonstrated that LOMO reduces memory consumption dramatically, enabling the fine-tuning of a 65 billion parameter model on a single machine with 8 RTX 3090 GPUs.

Methodology

LOMO's effectiveness is based on three components:

  1. SGD Utilization: By leveraging the simplicity of SGD, the authors eliminate the need for storing intermediate optimizer states, which are typically substantial for advanced optimizers like Adam.
  2. Fusion Update: LOMO integrates gradient computations directly with parameter updates during backpropagation. This innovative approach negates the necessity of storing gradient tensors, thus reducing memory overhead significantly.
  3. Precision Stabilization: To combat issues arising from mixed-precision training, such as precision degradation, LOMO incorporates techniques like gradient normalization, loss scaling, and selective full-precision computations.

Experimental Results

The experimental setup demonstrates the substantial benefits of LOMO in terms of memory allocation, throughput, and downstream task performance:

  • Memory Profiling: In substituting AdamW with LOMO, memory usage for fine-tuning a 7B parameter model decreased from 102.2 GB to 14.58 GB, attributed to the optimized handling of gradient and optimizer states.
  • Throughput: The throughput performance of LOMO significantly surpasses that of traditional optimizers. For instance, LOMO achieves an 11-fold improvement in training throughput for a 7B parameter model compared to AdamW.
  • Downstream Performance: Evaluations on SuperGLUE datasets using various LLM scales (7B, 13B, 30B, 65B) demonstrate that LOMO not only competes well with, but often exceeds, the efficiency of parameter-efficient fine-tuning methods such as LoRA, particularly as the model scale increases.

Implications and Future Directions

The implications of LOMO are profound for both practical and theoretical aspects of NLP:

  • Practical Implications: LOMO drastically lowers the hardware barrier for engaging in high-quality NLP research, democratizing access to advanced LLM fine-tuning capabilities. Providing a robust mechanism to fine-tune models with billions of parameters on consumer-grade hardware can accelerate the pace of innovation and adoption in smaller research and industry settings.
  • Theoretical Implications: The performance of LOMO suggests that the smoothness of loss surfaces in large models may be a pivotal factor in the feasibility of SGD-based optimization for fine-tuning. This warrants further exploration into the optimization landscapes of LLMs and the potential for more sophisticated, yet efficient, optimization algorithms.

Conclusion

The paper "Full Parameter Fine-tuning for LLMs with Limited Resources" robustly demonstrates that full parameter fine-tuning of LLMs can be achieved within the constraints of modest hardware configurations using the LOMO optimizer. This innovation not only broadens accessibility to LLM fine-tuning but also sets a new benchmark in memory-efficient model training. Future research may explore parameter quantization techniques and further theoretical analyses to continue pushing the boundaries of LLM optimization. The LOMO approach marks a significant step towards making high-performance NLP research more inclusive and efficient.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  2. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In Jennifer G. Dy and Andreas Krause (eds.), Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pp.  793–802. PMLR, 2018. URL http://proceedings.mlr.press/v80/chen18a.html.
  3. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp.  2924–2936. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1300. URL https://doi.org/10.18653/v1/n19-1300.
  4. The PASCAL recognising textual entailment challenge. In Joaquin Quiñonero Candela, Ido Dagan, Bernardo Magnini, and Florence d’Alché-Buc (eds.), Machine Learning Challenges, Evaluating Predictive Uncertainty, Visual Object Classification and Recognizing Textual Entailment, First PASCAL Machine Learning Challenges Workshop, MLCW 2005, Southampton, UK, April 11-13, 2005, Revised Selected Papers, volume 3944 of Lecture Notes in Computer Science, pp. 177–190. Springer, 2005. doi: 10.1007/11736790_9. URL https://doi.org/10.1007/11736790_9.
  5. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. CoRR, abs/2203.06904, 2022. doi: 10.48550/arXiv.2203.06904. URL https://doi.org/10.48550/arXiv.2203.06904.
  6. Visualizing and understanding the effectiveness of BERT. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp.  4141–4150. Association for Computational Linguistics, 2019. doi: 10.18653/v1/D19-1424. URL https://doi.org/10.18653/v1/D19-1424.
  7. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  8. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Marilyn A. Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pp.  252–262. Association for Computational Linguistics, 2018. doi: 10.18653/v1/n18-1023. URL https://doi.org/10.18653/v1/n18-1023.
  9. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.
  10. The winograd schema challenge. In Gerhard Brewka, Thomas Eiter, and Sheila A. McIlraith (eds.), Principles of Knowledge Representation and Reasoning: Proceedings of the Thirteenth International Conference, KR 2012, Rome, Italy, June 10-14, 2012. AAAI Press, 2012. URL http://www.aaai.org/ocs/index.php/KR/KR12/paper/view/4492.
  11. Prefix-tuning: Optimizing continuous prompts for generation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pp.  4582–4597. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.acl-long.353. URL https://doi.org/10.18653/v1/2021.acl-long.353.
  12. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  13. Fine-tuning language models with just forward passes. CoRR, abs/2305.17333, 2023. doi: 10.48550/arXiv.2305.17333. URL https://doi.org/10.48550/arXiv.2305.17333.
  14. Mixed precision training. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=r1gs9JgRZ.
  15. Efficient large-scale language model training on gpu clusters using megatron-lm. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–15, 2021.
  16. Automatic differentiation in pytorch. 2017.
  17. Wic: the word-in-context dataset for evaluating context-sensitive meaning representations. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp.  1267–1273. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1128. URL https://doi.org/10.18653/v1/n19-1128.
  18. Training large neural networks with constant memory using a new execution algorithm. arXiv preprint arXiv:2002.05645, 2020.
  19. Zero: Memory optimizations toward training trillion parameter models. SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16, 2020.
  20. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–14, 2021.
  21. Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp.  598–611, 2021a. doi: 10.1109/HPCA51647.2021.00057.
  22. Zero-offload: Democratizing billion-scale model training. USENIX Annual Technical Conference, pp.  551–564, 2021b.
  23. vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design. In 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), pp.  1–13, 2016. doi: 10.1109/MICRO.2016.7783721.
  24. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In Logical Formalizations of Commonsense Reasoning, Papers from the 2011 AAAI Spring Symposium, Technical Report SS-11-06, Stanford, California, USA, March 21-23, 2011. AAAI, 2011. URL http://www.aaai.org/ocs/index.php/SSS/SSS11/paper/view/2418.
  25. Sebastian Ruder. An overview of gradient descent optimization algorithms. CoRR, abs/1609.04747, 2016. URL http://arxiv.org/abs/1609.04747.
  26. A survey of optimization methods from a machine learning perspective. IEEE Trans. Cybern., 50(8):3668–3681, 2020. doi: 10.1109/TCYB.2019.2950779. URL https://doi.org/10.1109/TCYB.2019.2950779.
  27. A comparative study between full-parameter and lora-based fine-tuning on chinese instruction data for instruction following large language model. CoRR, abs/2304.08109, 2023. doi: 10.48550/arXiv.2304.08109. URL https://doi.org/10.48550/arXiv.2304.08109.
  28. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023. doi: 10.48550/arXiv.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
  29. Superglue: A stickier benchmark for general-purpose language understanding systems. In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pp.  3261–3275, 2019. URL https://proceedings.neurips.cc/paper/2019/hash/4496bf24afe7fab6f046bf4923da8de6-Abstract.html.
  30. Superneurons: dynamic gpu memory management for training deep neural networks. ACM SIGPLAN Notices, 53:41–53, 02 2018. doi: 10.1145/3200691.3178491.
  31. Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022, 2022. URL https://openreview.net/forum?id=yzkSU5zdwD.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Kai Lv (20 papers)
  2. Yuqing Yang (83 papers)
  3. Tengxiao Liu (7 papers)
  4. Qinghui Gao (2 papers)
  5. Qipeng Guo (72 papers)
  6. Xipeng Qiu (257 papers)
Citations (89)
Youtube Logo Streamline Icon: https://streamlinehq.com