Flatter, faster: scaling momentum for optimal speedup of SGD (2210.16400v2)
Abstract: Commonly used optimization algorithms often show a trade-off between good generalization and fast training times. For instance, stochastic gradient descent (SGD) tends to have good generalization; however, adaptive gradient methods have superior training times. Momentum can help accelerate training with SGD, but so far there has been no principled way to select the momentum hyperparameter. Here we study training dynamics arising from the interplay between SGD with label noise and momentum in the training of overparametrized neural networks. We find that scaling the momentum hyperparameter $1-\beta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization. To analytically derive this result we develop an architecture-independent framework, where the main assumption is the existence of a degenerate manifold of global minimizers, as is natural in overparametrized models. Training dynamics display the emergence of two characteristic timescales that are well-separated for generic values of the hyperparameters. The maximum acceleration of training is reached when these two timescales meet, which in turn determines the scaling limit we propose. We confirm our scaling rule for synthetic regression problems (matrix sensing and teacher-student paradigm) and classification for realistic datasets (ResNet-18 on CIFAR10, 6-layer MLP on FashionMNIST), suggesting the robustness of our scaling rule to variations in architectures and datasets.
- Zeyuan Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. The Journal of Machine Learning Research, 18(1):8194–8244, 2017.
- Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process. In Conference on learning theory, pp. 483–513. PMLR, 2020.
- On empirical comparisons of optimizers for deep learning. arXiv preprint arXiv:1910.05446, 2019.
- The loss surfaces of multilayer networks. In Artificial intelligence and statistics, pp. 192–204. PMLR, 2015.
- Label noise SGD provably prefers flat global minimizers. Advances in Neural Information Processing Systems, 34:27449–27461, 2021.
- Essentially no barriers in neural network energy landscape. In International conference on machine learning, pp. 1309–1318. PMLR, 2018.
- KJ Falconer. Differentiation of the limit mapping in a dynamical system. Journal of the London Mathematical Society, 2(2):356–372, 1983.
- Topology and geometry of half-rectified network optimization. ICLR, 2017.
- Loss surfaces, mode connectivity, and fast ensembling of DNNs. Advances in neural information processing systems, 31, 2018.
- Shape matters: Understanding the implicit bias of the noise covariance. In Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pp. 2315–2357. PMLR, 15–19 Aug 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Flat minima. Neural computation, 9(1):1–42, 1997.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Accelerating stochastic gradient descent for least squares regression. In Conference On Learning Theory, pp. 545–604. PMLR, 2018.
- Gary Shon Katzenberger. Solutions of a stochastic differential equation forced onto a manifold by a large drift. The Annals of Probability, 19(4):1587–1628, 1991.
- Improving generalization performance by switching from Adam to SGD. arXiv preprint arXiv:1712.07628, 2017.
- On large-batch training for deep learning: Generalization gap and sharp minima. ICLR, 2017.
- Better theory for sgd in the nonconvex world. arXiv preprint arXiv:2002.03329, 2020.
- On the insufficiency of existing momentum schemes for stochastic optimization. In Information Theory and Applications Workshop (ITA), pp. 1–9. IEEE, 2018.
- Adam: A method for stochastic optimization. ICLR, 2015.
- Learning multiple layers of features from tiny images. 2009.
- Explaining landscape connectivity of low-cost solutions for multilayer nets. Advances in neural information processing systems, 32, 2019.
- Limiting dynamics of sgd: Modified loss, phase space oscillations, and anomalous diffusion. arXiv preprint arXiv:2107.09133, 2021.
- Trajectory of mini-batch momentum: Batch size saturation and convergence in high dimensions. arXiv preprint arXiv:2206.01029, 2022.
- The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020.
- Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory, pp. 2–47. PMLR, 2018.
- What happens after SGD reaches zero loss?–a mathematical framework. ICLR, 2022.
- Accelerating SGD with momentum for over-parameterized learning. ICLR, 2020.
- Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 59:85–116, 2022.
- Kuang Liu. Train cifar10 with pytorch. https://github.com/kuangliu/pytorch-cifar/, February 2021. commit/49b7aa97b0c12fe0d4054e670403a16b6b834ddd.
- Bad global minima exist and sgd can reach them. Advances in Neural Information Processing Systems, 33:8543–8552, 2020.
- Just a momentum: Analytical study of momentum-based acceleration methods in paradigmatic high-dimensional non-convex problems. arXiv preprint arXiv:2102.11755, 2021.
- Deep double descent: Where bigger models and more data hurt. ICLR, 2020.
- Yurii E Nesterov. A method for solving the convex programming problem with convergence rate o(1/k2)𝑜1superscript𝑘2o(1/k^{2})italic_o ( 1 / italic_k start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). In Dokl. akad. nauk Sssr, volume 269, pp. 543–547, 1983.
- Quynh Nguyen. On connected sublevel sets in deep learning. In International conference on machine learning, pp. 4790–4799. PMLR, 2019.
- Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity. Advances in Neural Information Processing Systems, 34:29218–29230, 2021.
- Label noise (stochastic) gradient descent implicitly solves the lasso for quadratic parametrisation. arXiv preprint arXiv:2206.09841, 2022.
- Boris T Polyak. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5):1–17, 1964.
- Boris T Polyak. Introduction to optimization. optimization software. Inc., Publications Division, New York, 1:32, 1987.
- Fast maximum margin matrix factorization for collaborative prediction. In Proceedings of the 22nd international conference on Machine learning, pp. 713–719, 2005.
- The principles of deep learning theory. arXiv preprint arXiv:2106.10165, 2021.
- Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016.
- Empirical analysis of the hessian of over-parametrized neural networks. arXiv preprint arXiv:1706.04454, 2017.
- Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. ICLR, 2014.
- The implicit bias of gradient descent on separable data. arxiv e-prints, art. arXiv preprint arXiv:1710.10345, 2017.
- On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. PMLR, 2013.
- How noise affects the hessian spectrum in overparameterized neural networks. arXiv preprint arXiv:1910.00195, 2019.
- The marginal value of adaptive gradient methods in machine learning. Advances in neural information processing systems, 30, 2017.
- How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. Advances in Neural Information Processing Systems, 31, 2018.
- When does sgd favor flat minima? a quantitative characterization via linear stability. arXiv preprint arXiv:2207.02628, 2022.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. arXiv preprint arXiv:2002.03495, 2020.
- A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=wXgk_iCiYGo.
- Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum. In International Conference on Machine Learning, pp. 24430–24459. PMLR, 2022.
- Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
- Understanding deep learning requires rethinking generalization. corr abs/1611.03530. ICLR, 2017.
- Aditya Cowsik (10 papers)
- Tankut Can (18 papers)
- Paolo Glorioso (32 papers)