Leveraging Continuous Time to Understand Momentum When Training Diagonal Linear Networks (2403.05293v1)
Abstract: In this work, we investigate the effect of momentum on the optimisation trajectory of gradient descent. We leverage a continuous-time approach in the analysis of momentum gradient descent with step size $\gamma$ and momentum parameter $\beta$ that allows us to identify an intrinsic quantity $\lambda = \frac{ \gamma }{ (1 - \beta)2 }$ which uniquely defines the optimisation path and provides a simple acceleration rule. When training a $2$-layer diagonal linear network in an overparametrised regression setting, we characterise the recovered solution through an implicit regularisation problem. We then prove that small values of $\lambda$ help to recover sparse solutions. Finally, we give similar but weaker results for stochastic momentum gradient descent. We provide numerical experiments which support our claims.
- Felipe Alvarez. On the minimizing property of a second order dissipative system in hilbert spaces. SIAM Journal on Control and Optimization, 38(4):1102–1119, 2000.
- Convergence rates for the heavy-ball continuous dynamics for non-convex optimization, under polyak–lojasiewicz condition. Journal of Global Optimization, 84(3):563–589, 2022. ISSN 1573-2916.
- Implicit Regularization in Deep Matrix Factorization. Curran Associates Inc., 2019.
- The heavy ball with friction method, i. the continuous dynamical system: Global exploration of the local minima of a real-valued function by asymptotic analysis of a dissipitive dynamical system. Communications in Contemporary Mathematics, 2(1):1–34, 2000.
- Legendre functions and the method of random bregman projections. Journal of Convex Analysis, 4:27–67, 1997.
- A descent lemma beyond lipschitz gradient continuity: First-order methods revisited and applications. Math. Oper. Res., 42:330–348, 2017.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Accelerated linear convergence of stochastic momentum methods in wasserstein distances. In International Conference on Machine Learning, pages 891–901. PMLR, 2019.
- Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Momentum improves normalized sgd. In International conference on machine learning, pages 2260–2268. PMLR, 2020.
- Aaron Defazio. Understanding the role of momentum in non-convex optimization: Practical insights from a lyapunov analysis. ArXiv, 2020.
- Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- (s)gd over diagonal linear networks: Implicit regularisation, large stepsizes and edge of stability. NeurIPS 2023, 2023.
- From averaging to acceleration, there is only a step-size. In Conference on Learning Theory, pages 658–695. PMLR, 2015.
- Global convergence of the heavy-ball method for convex optimization. In 2015 European control conference (ECC), pages 310–315. IEEE, 2015.
- Exponentiated gradient meets gradient descent. In Proceedings of the 31st International Conference on Algorithmic Learning Theory, volume 117 of Proceedings of Machine Learning Research, pages 386–407. PMLR, 2020.
- Implicit regularization in heavy-ball momentum accelerated stochastic gradient descent. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023.
- X. Goudou and J. Munier. The gradient and heavy ball with friction dynamical systems: the quasiconvex case. Mathematical Programming, 116(1):173–191, 2009.
- Characterizing implicit bias in terms of optimization geometry. In Jennifer G. Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 of Proceedings of Machine Learning Research, pages 1827–1836. PMLR, 2018.
- Shape matters: Understanding the implicit bias of the noise covariance. In Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 2315–2357. PMLR, 15–19 Aug 2021.
- A. Haraux and M.A. Jendoubi. Convergence of solutions of second-order gradient-like systems with analytic nonlinearities. Journal of Differential Equations, 144(2):313–320, 1998.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Towards understanding how momentum improves generalization in deep learning. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 9965–10040. PMLR, 2022.
- Accelerated gradient descent escapes saddle points faster than gradient descent. In Conference On Learning Theory, pages 1042–1085. PMLR, 2018.
- On the insufficiency of existing momentum schemes for stochastic optimization. In 2018 Information Theory and Applications Workshop (ITA), pages 1–9. IEEE, 2018.
- Continuous time analysis of momentum methods. J. Mach. Learn. Res., 22(1), 2021.
- The two regimes of deep network training. arXiv preprint arXiv:2002.10376, 2020.
- B Lemaire. An asymptotical variational principle associated with the steepest descent method for a convex function. Journal of Convex Analysis, 3(1):63–70, 1996.
- Implicit bias of gradient descent on reparametrized models: On equivalence to mirror descent. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems 35 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022, Advances in Neural Information Processing Systems. Neural information processing systems foundation, 2022.
- An improved analysis of stochastic gradient descent with momentum. Advances in Neural Information Processing Systems, 33:18261–18271, 2020.
- Convergence of a stochastic gradient method with momentum for non-smooth non-convex optimization. In International conference on machine learning, pages 6630–6639. PMLR, 2020.
- A systematic approach to lyapunov analyses of continuous-time models in convex optimization. SIAM Journal on Optimization, 33(3):1558–1586, 2023.
- Implicit bias of the step size in linear diagonal neural networks. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 16270–16295. PMLR, 2022.
- Problem Complexity and Method Efficiency in Optimization. John Wiley & Sons, New York, 1979.
- The role of memory in stochastic optimization. In Uncertainty in Artificial Intelligence, pages 356–366. PMLR, 2020.
- Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity. Advances in Neural Information Processing Systems, 34:29218–29230, 2021.
- Label noise (stochastic) gradient descent implicitly solves the lasso for quadratic parametrisation. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178 of Proceedings of Machine Learning Research, pages 2127–2159. PMLR, 2022.
- Lyapunov functions: An optimization theory perspective. IFAC-PapersOnLine, 50(1):7456–7461, 2017.
- B.T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- The connections between lyapunov functions for some optimization algorithms and differential equations. SIAM Journal on Numerical Analysis, 59(3):1542–1565, 2021.
- Almost sure convergence rates for stochastic gradient descent and stochastic heavy ball. In Conference on Learning Theory, pages 3935–3971. PMLR, 2021.
- Understanding the acceleration phenomenon via high-resolution differential equations. Mathematical Programming, pages 1–70, 2021.
- A differential equation for modeling nesterov’s accelerated gradient method: theory and insights. Advances in neural information processing systems, 27, 2014.
- Heavy-ball algorithms always escape saddle points. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI’19, page 3520–3526. AAAI Press, 2019.
- On the importance of initialization and momentum in deep learning. In Sanjoy Dasgupta and David McAllester, editors, Proceedings of the 30th International Conference on Machine Learning, volume 28 of Proceedings of Machine Learning Research, pages 1139–1147, Atlanta, Georgia, USA, 2013. PMLR.
- Implicit regularization for optimal sparse recovery. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 2019.
- The implicit regularization of momentum gradient descent in overparametrized models. Proceedings of the AAAI Conference on Artificial Intelligence, 37(8):10149–10156, 2023.
- A variational perspective on accelerated methods in optimization. proceedings of the National Academy of Sciences, 113(47):E7351–E7358, 2016.
- A lyapunov analysis of accelerated methods in optimization. The Journal of Machine Learning Research, 22(1):5040–5073, 2021.
- Implicit regularization in ai meets generalized hardness of approximation in optimization–sharp results for diagonal linear networks. arXiv preprint arXiv:2307.07410, 2023.
- Kernel and rich regimes in overparametrized models. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 3635–3673. PMLR, 2020.
- Understanding deep learning requires rethinking generalization. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017.