Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AdamL: A fast adaptive gradient method incorporating loss function (2312.15295v1)

Published 23 Dec 2023 in stat.ML, cs.LG, and math.OC

Abstract: Adaptive first-order optimizers are fundamental tools in deep learning, although they may suffer from poor generalization due to the nonuniform gradient scaling. In this work, we propose AdamL, a novel variant of the Adam optimizer, that takes into account the loss function information to attain better generalization results. We provide sufficient conditions that together with the Polyak-Lojasiewicz inequality, ensure the linear convergence of AdamL. As a byproduct of our analysis, we prove similar convergence properties for the EAdam, and AdaBelief optimizers. Experimental results on benchmark functions show that AdamL typically achieves either the fastest convergence or the lowest objective function values when compared to Adam, EAdam, and AdaBelief. These superior performances are confirmed when considering deep learning tasks such as training convolutional neural networks, training generative adversarial networks using vanilla convolutional neural networks, and long short-term memory networks. Finally, in the case of vanilla convolutional neural networks, AdamL stands out from the other Adam's variants and does not require the manual adjustment of the learning rate during the later stage of the training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Understanding the unstable convergence of gradient descent. In International Conference on Machine Learning, pages 247–257. PMLR, 2022.
  2. Wasserstein generative adversarial networks. In Proceedings of the 34th International Conference on Machine Learning, Aug 2017.
  3. On the convergence of a class of adam-type algorithms for non-convex optimization. In International Conference on Learning Representations, 2020.
  4. S. F. Chen and J. Goodman. An empirical study of smoothing techniques for language modeling. Computer Speech & Language, 13(4):359–394, 1999. ISSN 0885-2308. doi: https://doi.org/10.1006/csla.1999.0128.
  5. Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
  6. S. Ghadimi and G. Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  7. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization. Mathematical Programming, 155(1-2):267–305, 2016.
  8. Deep residual learning for image recognition. In the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  9. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  10. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in neural information processing systems, 30, 2017.
  11. Densely connected convolutional networks. In the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
  12. N. S. Keskar and R. Socher. Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017.
  13. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  14. Adaptive gradient methods with dynamic bound of learning rate. In International Conference on Learning Representations, New Orleans, Louisiana, May 2019.
  15. Robust stochastic approximation approach to stochastic programming. SIAM Journal on optimization, 19(4):1574–1609, 2009.
  16. Tight dimension independent lower bound on the expected convergence rate for diminishing step sizes in sgd. Advances in Neural Information Processing Systems, 32, 2019.
  17. N. Qian. On the momentum term in gradient descent learning algorithms. Neural networks, 12(1):145–151, 1999.
  18. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018.
  19. H. Robbins and S. Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
  20. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In 3rd International Conference on Learning Representations, 2015.
  21. T. Tieleman and G. Hinton. Rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learnin, 2012.
  22. The marginal value of adaptive gradient methods in machine learning. Advances in neural information processing systems, 30, 2017.
  23. On the convergence of the gradient descent method with stochastic fixed-point rounding errors under the Polyak-Lojasiewicz inequality. arXiv preprint arXiv:2301.09511, 2023.
  24. W. Yuan and K.-X. Gao. EAdam Optimizer: How ϵitalic-ϵ\epsilonitalic_ϵ Impact Adam. arXiv preprint arXiv:2011.02150, 2020.
  25. Adaptive methods for nonconvex optimization. Advances in neural information processing systems, 31, 2018.
  26. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in neural information processing systems, 33:18795–18806, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Lu Xia (15 papers)
  2. Stefano Massei (30 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.