Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer (2402.15173v2)

Published 23 Feb 2024 in cs.LG

Abstract: Fine-tuning LLMs with classic first-order optimizers entails prohibitive GPU memory due to the backpropagation process. Recent works have turned to zeroth-order optimizers for fine-tuning, which save substantial memory by using two forward passes. However, these optimizers are plagued by the heterogeneity of parameter curvatures across different dimensions. In this work, we propose HiZOO, a diagonal Hessian informed zeroth-order optimizer which is the first work to leverage the diagonal Hessian to enhance zeroth-order optimizer for fine-tuning LLMs. What's more, HiZOO avoids the expensive memory cost and only increases one forward pass per step. Extensive experiments on various models (350M~66B parameters) indicate that HiZOO improves model convergence, significantly reducing training steps and effectively enhancing model accuracy. Moreover, we visualize the optimization trajectories of HiZOO on test functions, illustrating its effectiveness in handling heterogeneous curvatures. Lastly, we provide theoretical proofs of convergence for HiZOO. Code is publicly available at https://anonymous.4open.science/r/HiZOO27F8.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Information-theoretic lower bounds on the oracle complexity of convex optimization. Advances in Neural Information Processing Systems, 22, 2009.
  2. Scalable second order optimization for deep learning, 2021.
  3. Language models are few-shot learners, 2020.
  4. BROYDEN, C. G. The Convergence of a Class of Double-rank Minimization Algorithms 1. General Considerations. IMA Journal of Applied Mathematics, 6(1):76–90, 03 1970. ISSN 0272-4960. doi: 10.1093/imamat/6.1.76. URL https://doi.org/10.1093/imamat/6.1.76.
  5. A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  1193–1203. PMLR, 18–24 Jul 2021.
  6. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450352024. doi: 10.1145/3128572.3140448. URL https://doi.org/10.1145/3128572.3140448.
  7. Trust-Region Methods. Society for Industrial and Applied Mathematics, USA, 2000. ISBN 0898714605.
  8. Inexact newton methods. SIAM Journal on Numerical Analysis, 19(2):400–408, 1982. doi: 10.1137/0719025. URL https://doi.org/10.1137/0719025.
  9. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019. URL https://api.semanticscholar.org/CorpusID:52967399.
  10. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res., 12(null):2121–2159, jul 2011. ISSN 1532-4435.
  11. FairScale authors. Fairscale: A general purpose modular pytorch library for high performance and large scale training. https://github.com/facebookresearch/fairscale, 2021.
  12. Fast approximate natural gradient descent in a kronecker-factored eigenbasis. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, pp.  9573–9583, Red Hook, NY, USA, 2018. Curran Associates Inc.
  13. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  14. An investigation into neural net optimization via hessian eigenvalue density. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  2232–2241. PMLR, 09–15 Jun 2019. URL https://proceedings.mlr.press/v97/ghorbani19b.html.
  15. Gradient-free multi-agent nonconvex nonsmooth optimization. 2018 IEEE Conference on Decision and Control (CDC), pp.  4939–4944, 2018. URL https://api.semanticscholar.org/CorpusID:58669445.
  16. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  17. Query complexity of derivative-free optimization. Advances in Neural Information Processing Systems, 25, 2012.
  18. Zo-adamu optimizer: Adapting perturbation by the momentum and uncertainty in zeroth-order optimization, 2023.
  19. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), San Diega, CA, USA, 2015.
  20. Albert: A lite bert for self-supervised learning of language representations. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=H1eA7AEtvS.
  21. Prefix-tuning: Optimizing continuous prompts for generation. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4582–4597, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.353. URL https://aclanthology.org/2021.acl-long.353.
  22. Sophia: A scalable stochastic second-order optimizer for language model pre-training. https://synthical.com/article/17aca766-2012-4c7c-a0f4-5b785dadabf9, 4 2023.
  23. signsgd via zeroth-order oracle. In International Conference on Learning Representations, 2019a. URL https://api.semanticscholar.org/CorpusID:108298677.
  24. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019b.
  25. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  26. Magnus, J. R. et al. The moments of products of quadratic forms in normal variables. Univ., Instituut voor Actuariaat en Econometrie, 1978.
  27. Improving the convergence of the backpropagation algorithm using learning rate adaptation methods. Neural Comput., 11(7):1769–1796, oct 1999. ISSN 0899-7667. doi: 10.1162/089976699300016223. URL https://doi.org/10.1162/089976699300016223.
  28. Fine-tuning language models with just forward passes. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Vota6rFhBQ.
  29. Martens, J. Deep learning via hessian-free optimization. In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10, pp.  735–742, Madison, WI, USA, 2010. Omnipress. ISBN 9781605589077.
  30. Cubic regularization of newton method and its global performance. Math. Program., 108(1):177–205, aug 2006. ISSN 0025-5610.
  31. Revisiting natural gradient for deep networks, 2014.
  32. Information-based complexity, feedback and dynamics in convex programming. IEEE Transactions on Information Theory, 57(10):7036–7056, 2011.
  33. Eigenvalues of the hessian in deep learning: Singularity and beyond, 2017. URL https://openreview.net/forum?id=B186cP9gx.
  34. No more pesky learning rates. In Dasgupta, S. and McAllester, D. (eds.), Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pp.  343–351, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR. URL https://proceedings.mlr.press/v28/schaul13.html.
  35. Spall, J. Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Transactions on Automatic Control, 37(3):332–341, 1992. doi: 10.1109/9.119632.
  36. Spall, J. C. A one-measurement form of simultaneous perturbation stochastic approximation. Automatica, 33(1):109–112, jan 1997. ISSN 0005-1098. doi: 10.1016/S0005-1098(96)00149-5. URL https://doi.org/10.1016/S0005-1098(96)00149-5.
  37. Distributed zero-order algorithms for nonconvex multiagent optimization. IEEE Transactions on Control of Network Systems, 8(1):269–281, 2021. doi: 10.1109/TCNS.2020.3024321.
  38. Llama: Open and efficient foundation language models. ArXiv, abs/2302.13971, 2023. URL https://api.semanticscholar.org/CorpusID:257219404.
  39. Algorithm for stochastic approximation with trial input perturbation in the nonstationary problem of optimization. Autom. Remote Control, 70(11):1827–1835, nov 2009. ISSN 0005-1179. doi: 10.1134/S000511790911006X. URL https://doi.org/10.1134/S000511790911006X.
  40. Zeroth-order algorithms for nonconvex minimax problems with improved complexities. arXiv preprint arXiv:2001.07819, 2020.
  41. Newton-type methods for non-convex optimization under inexact hessian information, 2019.
  42. Inexact non-convex newton-type methods, 2018.
  43. Adahessian: An adaptive second order optimizer for machine learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12):10665–10673, May 2021. doi: 10.1609/aaai.v35i12.17275. URL https://ojs.aaai.org/index.php/AAAI/article/view/17275.
  44. Ye, H. Mirror natural evolution strategies, 2023.
  45. Hessian-aware zeroth-order optimization for black-box adversarial attack, 2019.
  46. Large batch optimization for deep learning: Training bert in 76 minutes. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Syx4wnEtvH.
  47. Zeiler, M. D. Adadelta: An adaptive learning rate method, 2012.
  48. Why are adaptive methods good for attention models? In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  15383–15393. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/b05b57f6add810d3b7490866d74c0053-Paper.pdf.
  49. Eva: Practical second-order optimization with kronecker-vectorized approximation. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=_Mic8V96Voy.
  50. Opt: Open pre-trained transformer language models, 2022a.
  51. A new one-point residual-feedback oracle for black-box learning and control. Automatica, 136(C), feb 2022b. ISSN 0005-1098. doi: 10.1016/j.automatica.2021.110006. URL https://doi.org/10.1016/j.automatica.2021.110006.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yanjun Zhao (9 papers)
  2. Sizhe Dang (4 papers)
  3. Haishan Ye (41 papers)
  4. Guang Dai (38 papers)
  5. Yi Qian (23 papers)
  6. Ivor W. Tsang (109 papers)
Citations (4)