Can We Remove the Square-Root in Adaptive Gradient Methods? A Second-Order Perspective (2402.03496v10)
Abstract: Adaptive gradient optimizers like Adam(W) are the default training algorithms for many deep learning architectures, such as transformers. Their diagonal preconditioner is based on the gradient outer product which is incorporated into the parameter update via a square root. While these methods are often motivated as approximate second-order methods, the square root represents a fundamental difference. In this work, we investigate how the behavior of adaptive methods changes when we remove the root, i.e., strengthen their second-order motivation. Surprisingly, we find that such square-root-free adaptive methods close the generalization gap to SGD on convolutional architectures, while maintaining their root-based counterpart's performance on transformers. The second-order perspective also has practical benefits for developing non-diagonal methods that can incorporate arbitrary curvature approximations through the concept of preconditioner invariance. In contrast to root-based methods like Shampoo, root-free counterparts work well and fast with half-precision since they do not require numerically unstable matrix root decompositions and inversions. Overall, our findings provide new insights into the development of adaptive methods and raise important questions regarding the overlooked role of adaptivity in their success. (experiment code: https://github.com/yorkerlin/remove-the-square-root optimizer code: https://github.com/f-dangel/sirfshampoo)
- Efficient full-matrix adaptive regularization. In International Conference on Machine Learning, pp. 102–110. PMLR, 2019.
- Scalable second order optimization for deep learning. arXiv preprint arXiv:2002.09018, 2020.
- An overview of evolutionary algorithms for parameter optimization. Evolutionary computation, 1(1):1–23, 1993.
- Dissecting adam: The sign, magnitude and variance of stochastic gradients. In International Conference on Machine Learning, pp. 404–413. PMLR, 2018.
- Ensemble learning for multi-layer networks. Advances in neural information processing systems, 10, 1997.
- Improving the convergence of back-propagation learning with second order methods. In Proceedings of the 1988 connectionist models summer school, pp. 29–37, 1988.
- Random search for hyper-parameter optimization. Journal of machine learning research, 13(2), 2012.
- Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018.
- Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675, 2023.
- On empirical comparisons of optimizers for deep learning. arXiv preprint arXiv:1910.05446, 2019.
- Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7), 2011.
- Shampoo: Preconditioned Stochastic Tensor Optimization. In Proceedings of the 35th International Conference on Machine Learning, pp. 1842–1850, 2018.
- Logarithmic regret algorithms for online convex optimization. Machine Learning, 69:169–192, 2006.
- Fast and scalable Bayesian deep learning by weight-perturbation in Adam. In Proceedings of the 35th International Conference on Machine Learning, pp. 2611–2620, 2018.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Limitations of the empirical fisher approximation for natural gradient descent. Advances in neural information processing systems, 32, 2019.
- Noise is not the main factor behind the gap between sgd and adam on transformers, but sign descent might be. arXiv preprint arXiv:2304.13960, 2023.
- Simplifying momentum-based positive-definite submanifold optimization with applications to deep learning. In International Conference on Machine Learning, pp. 21026–21050. PMLR, 2023.
- Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Martens, J. New insights and perspectives on the natural gradient method. The Journal of Machine Learning Research, 21(1):5776–5851, 2020.
- Training deep and recurrent networks with hessian-free optimization. In Neural Networks: Tricks of the Trade: Second Edition, pp. 479–535. Springer, 2012.
- Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
- Variants of rmsprop and adagrad with logarithmic regret bounds. 2017.
- Interior-point polynomial algorithms in convex programming. SIAM, 1994.
- The variational Gaussian approximation revisited. Neural computation, 21(3):786–792, 2009.
- On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019.
- Topmoumoute online natural gradient algorithm. Advances in neural information processing systems, 20, 2007.
- A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale. arXiv preprint arXiv:2309.06497, 2023.
- Variational optimization. arXiv preprint arXiv:1212.4507, 2012.
- Gaussian adaptation, an evolution-based efficient global optimizer. 1992.
- Rmsprop: Divide the gradient by a running average of its recent magnitude. Coursera, 2012.
- Sadam: A variant of adam for strongly convex functions. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=rye5YaEtPr.
- The marginal value of adaptive gradient methods in machine learning. Advances in neural information processing systems, 30, 2017.
- Wu Lin (16 papers)
- Felix Dangel (20 papers)
- Runa Eschenhagen (16 papers)
- Juhan Bae (20 papers)
- Richard E. Turner (112 papers)
- Alireza Makhzani (21 papers)