Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ZO-AdaMU Optimizer: Adapting Perturbation by the Momentum and Uncertainty in Zeroth-order Optimization (2312.15184v1)

Published 23 Dec 2023 in cs.LG

Abstract: Lowering the memory requirement in full-parameter training on large models has become a hot research area. MeZO fine-tunes the LLMs by just forward passes in a zeroth-order SGD optimizer (ZO-SGD), demonstrating excellent performance with the same GPU memory usage as inference. However, the simulated perturbation stochastic approximation for gradient estimate in MeZO leads to severe oscillations and incurs a substantial time overhead. Moreover, without momentum regularization, MeZO shows severe over-fitting problems. Lastly, the perturbation-irrelevant momentum on ZO-SGD does not improve the convergence rate. This study proposes ZO-AdaMU to resolve the above problems by adapting the simulated perturbation with momentum in its stochastic approximation. Unlike existing adaptive momentum methods, we relocate momentum on simulated perturbation in stochastic gradient approximation. Our convergence analysis and experiments prove this is a better way to improve convergence stability and rate in ZO-SGD. Extensive experiments demonstrate that ZO-AdaMU yields better generalization for LLMs fine-tuning across various NLP tasks than MeZO and its momentum variants.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901.
  2. Zo-adamm: Zeroth-order adaptive momentum method for black-box optimization. Advances in neural information processing systems, 32.
  3. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In Proceedings of NAACL-HLT, 2924–2936.
  4. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. arXiv preprint arXiv:2203.06904.
  5. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 320–335.
  6. DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs. In Proceedings of NAACL-HLT, 2368–2378.
  7. On the effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, 11, 12799–12807.
  8. Making pre-trained language models better few-shot learners. In Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021, 3816–3830. Association for Computational Linguistics (ACL).
  9. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4): 2341–2368.
  10. An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning, 2232–2241. PMLR.
  11. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
  12. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations (ICLR). San Diega, CA, USA.
  13. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 4582–4597.
  14. LMEye: An Interactive Perception Network for Large Language Models. arXiv preprint arXiv:2305.03701.
  15. Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination. Eye, 1–2.
  16. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  17. Full Parameter Fine-tuning for Large Language Models with Limited Resources. arXiv preprint arXiv:2306.09782.
  18. Fine-Tuning Language Models with Just Forward Passes. arXiv preprint arXiv:2305.17333.
  19. A kernel-based view of language model fine-tuning. In International Conference on Machine Learning, 23610–23641. PMLR.
  20. Global random optimization by simultaneous perturbation stochastic approximation. In Proceedings of the 2001 American control conference.(Cat. No. 01CH37148), volume 2, 756–762. IEEE.
  21. Stable Optimization of Gaussian Likelihoods.
  22. Papyan, V. 2018. The full spectrum of deepnet hessians at scale: Dynamics with sgd training and sample size. arXiv preprint arXiv:1811.07062.
  23. Papyan, V. 2020. Traces of class/cross-class structure pervade deep learning spectra. The Journal of Machine Learning Research, 21(1): 10197–10260.
  24. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 1–16. IEEE.
  25. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2383–2392.
  26. Empirical Analysis of the Hessian of Over-Parametrized Neural Networks. ArXiv, abs/1706.04454.
  27. Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 255–269.
  28. Shamir, O. 2017. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. The Journal of Machine Learning Research, 18(1): 1703–1713.
  29. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In International conference on machine learning, 71–79. PMLR.
  30. How does in-context learning help prompt tuning? arXiv preprint arXiv:2302.11521.
  31. A Comparative Study between Full-Parameter and LoRA-based Fine-Tuning on Chinese Instruction Data for Instruction Following Large Language Model. arXiv preprint arXiv:2304.08109.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  33. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  34. Differentially private empirical risk minimization with smooth non-convex loss functions: A non-stationary view. In Proceedings of the AAAI Conference on Artificial Intelligence, 01, 1182–1189.
  35. Emergent Abilities of Large Language Models. Transactions on Machine Learning Research. Survey Certification.
  36. An Efficient Memory-Augmented Transformer for Knowledge-Intensive NLP Tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 5184–5196.
  37. Dissecting hessian: Understanding common structure of hessian in neural networks. arXiv preprint arXiv:2010.04261.
  38. Pyhessian: Neural networks through the lens of the hessian. In 2020 IEEE international conference on big data (Big data), 581–590. IEEE.
  39. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  40. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in neural information processing systems, 33: 18795–18806.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Shuoran Jiang (5 papers)
  2. Qingcai Chen (36 papers)
  3. Youchen Pan (1 paper)
  4. Yang Xiang (187 papers)
  5. Yukang Lin (9 papers)
  6. Xiangping Wu (22 papers)
  7. Chuanyi Liu (12 papers)
  8. Xiaobao Song (2 papers)
Citations (4)