Papers
Topics
Authors
Recent
Search
2000 character limit reached

Improving the Adaptive Moment Estimation (ADAM) stochastic optimizer through an Implicit-Explicit (IMEX) time-stepping approach

Published 20 Mar 2024 in cs.CE, cs.LG, cs.NA, and math.NA | (2403.13704v2)

Abstract: The Adam optimizer, often used in Machine Learning for neural network training, corresponds to an underlying ordinary differential equation (ODE) in the limit of very small learning rates. This work shows that the classical Adam algorithm is a first-order implicit-explicit (IMEX) Euler discretization of the underlying ODE. Employing the time discretization point of view, we propose new extensions of the Adam scheme obtained by using higher-order IMEX methods to solve the ODE. Based on this approach, we derive a new optimization algorithm for neural network training that performs better than classical Adam on several regression and classification problems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Computer Methods for Ordinary Differential Equations and Differential-Algebraic Equations. Society for Industrial and Applied Mathematics, USA, 1st edition, 1998.
  2. Implicit-explicit runge-kutta methods for time-dependent partial differential equations. Applied Numerical Mathematics, 25(2):151–167, 1997. Special Issue on Time Integration.
  3. Yoshua Bengio. Practical Recommendations for Gradient-Based Training of Deep Architectures, pages 437–478. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.
  4. Language models are few-shot learners, 2020.
  5. J.C. Butcher. Numerical methods for ordinary differential equations in the 20th century. Journal of Computational and Applied Mathematics, 125(1):1–29, 2000. Numerical Analysis 2000. Vol. VI: Ordinary Differential Equations and Integral Equations.
  6. Augustin Cauchy et al. General method for solving systems of equations simultaneously. Comp. Rend. Sci. Paris, 25(1847):536–538, 1847.
  7. On empirical comparisons of optimizers for deep learning. CoRR, abs/1910.05446, 2019.
  8. Integration of stiff equations*. Proceedings of the National Academy of Sciences, 38(3):235–243, 1952.
  9. A general system of differential equations to model first order adaptive algorithms, 2019.
  10. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159, 2011.
  11. Understanding the difficulty of training deep feedforward neural networks. In Yee Whye Teh and Mike Titterington, editors, Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR.
  12. Solving Ordinary Differential Equations I Nonstiff problems. Springer, Berlin, second edition, 2000.
  13. Solving Ordinary Differential Equations II. Stiff and Differential-Algebraic Problems, volume 14. Springer, 01 1996.
  14. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  15. Additive runge–kutta schemes for convection–diffusion–reaction equations. Applied Numerical Mathematics, 44(1):139–181, 2003.
  16. Adam: A method for stochastic optimization, 2017.
  17. Auto-encoding variational bayes, 2013.
  18. An alternative view: When does sgd escape local minima?, 2018.
  19. Imagenet classification with deep convolutional neural networks. Communications of the ACM, 60(6):84–90, 2017.
  20. Edward N. Lorenz. Deterministic nonperiodic flow. Journal of Atmospheric Sciences, 20(2):130 – 141, 1963.
  21. A review on weight initialization strategies for neural networks. Artif. Intell. Rev., 55(1):291–322, jan 2022.
  22. Yurii Nesterov. A method for unconstrained convex minimization problem with the rate of convergence o⁢(1/k2)𝑜1superscript𝑘2o(1/k^{2})italic_o ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Proceedings of the USSR Academy of Sciences, 269:543–547, 1983.
  23. Boris Polyak. Some methods of speeding up the convergence of iteration methods. Ussr Computational Mathematics and Mathematical Physics, 4:1–17, 12 1964.
  24. On the convergence of adam and beyond, 2019.
  25. Computationalsciencelaboratory/ode-test-problems: v0.0.1, June 2022.
  26. A generalized-structure approach to additive Runge–Kutta methods. SIAM Journal on Numerical Analysis, 53(1):17–42, 2015.
  27. Chi-Wang Shu. Total-variation-diminishing time discretizations. SIAM Journal on Scientific and Statistical Computing, 9(6):1073–1084, 1988.
  28. Efficient Implementation of Essentially Non-oscillatory Shock-Capturing Schemes. Journal of Computational Physics, 77(2):439–471, August 1988.
  29. Very deep convolutional networks for large-scale image recognition, 2015.
  30. Leslie N. Smith. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 464–472, 2017.
  31. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(56):1929–1958, 2014.
  32. A differential equation for modeling nesterov’s accelerated gradient method: Theory and insights, 2015.
  33. On the importance of initialization and momentum in deep learning. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1139–1147, Atlanta, Georgia, USA, 17–19 Jun 2013. PMLR.
  34. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
  35. Adaptive methods for nonconvex optimization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  36. Matthew D. Zeiler. Adadelta: An adaptive learning rate method, 2012.
Citations (1)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.