Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AdaPlus: Integrating Nesterov Momentum and Precise Stepsize Adjustment on AdamW Basis (2309.01966v2)

Published 5 Sep 2023 in cs.LG

Abstract: This paper proposes an efficient optimizer called AdaPlus which integrates Nesterov momentum and precise stepsize adjustment on AdamW basis. AdaPlus combines the advantages of AdamW, Nadam, and AdaBelief and, in particular, does not introduce any extra hyper-parameters. We perform extensive experimental evaluations on three machine learning tasks to validate the effectiveness of AdaPlus. The experiment results validate that AdaPlus (i) among all the evaluated adaptive methods, performs most comparable with (even slightly better than) SGD with momentum on image classification tasks and (ii) outperforms other state-of-the-art optimizers on LLMing tasks and illustrates pretty high stability when training GANs. The experiment code of AdaPlus will be accessible at: https://github.com/guanleics/AdaPlus.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. “On the importance of initialization and momentum in deep learning,” in International conference on machine learning. PMLR, 2013, pp. 1139–1147.
  2. “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  3. “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  4. Y Nesterov, “A method of solving a convex programming problem with convergence rate mathcal {{\{{O}}\}}(1/k {{\{{2}}\}}),” in Sov. Math. Dokl, vol. 27.
  5. Timothy Dozat, “Incorporating nesterov momentum into adam,” 2016.
  6. “Adabelief optimizer: Adapting stepsizes by the belief in observed gradients,” Advances in neural information processing systems, vol. 33, pp. 18795–18806, 2020.
  7. “On the variance of the adaptive learning rate and beyond,” arXiv preprint arXiv:1908.03265, 2019.
  8. “Win: Weight-decay-integrated nesterov acceleration for adaptive gradient algorithms,” in The Eleventh International Conference on Learning Representations, 2022.
  9. “Symbolic discovery of optimization algorithms,” arXiv preprint arXiv:2302.06675, 2023.
  10. “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  11. “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  12. “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
  13. “Long short-term memory neural network for traffic speed prediction using remote microwave sensor data,” Transportation Research Part C: Emerging Technologies, vol. 54, pp. 187–197, 2015.
  14. “Wasserstein generative adversarial networks,” in International conference on machine learning. PMLR, 2017, pp. 214–223.
  15. “Improved techniques for training gans,” Advances in neural information processing systems, vol. 29, 2016.
  16. “Adaptive subgradient methods for online learning and stochastic optimization.,” Journal of machine learning research, vol. 12, no. 7, 2011.
  17. “Lecture 6.5-rmsprop, coursera: Neural networks for machine learning,” University of Toronto, Technical Report, vol. 6, 2012.
  18. “Adaptive methods for nonconvex optimization,” in Advances in Neural Information Processing Systems, 2018, vol. 31, pp. 9815–9825.
  19. “On the convergence of adam and beyond,” arXiv preprint arXiv:1904.09237, 2019.
  20. “Rethinking adam: A twofold exponential moving average approach,” arXiv preprint arXiv:2106.11514, 2021.
  21. “Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models,” arXiv preprint arXiv:2208.06677, 2022.
  22. “Xgrad: Boosting gradient-based optimizers with weight prediction,” arXiv preprint arXiv:2305.18240, 2023.
  23. “Xpipe: Efficient pipeline model parallelism for multi-gpu dnn training,” arXiv preprint arXiv:1911.04610, 2019.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com