Rethinking Gauss-Newton for learning over-parameterized models (2302.02904v3)
Abstract: This work studies the global convergence and implicit bias of Gauss Newton's (GN) when optimizing over-parameterized one-hidden layer networks in the mean-field regime. We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning. We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN's method. While GN is consistently faster than GD in finding a global optimum, the learned model generalizes well on test data when starting from random initial weights with a small variance and using a small step size to slow down convergence. Specifically, our study shows that such a setting results in a hidden learning phenomenon, where the dynamics are able to recover features with good generalization properties despite the model having sub-optimal training and test performances due to an under-optimized linear layer. This study exhibits a trade-off between the convergence speed of GN and the generalization ability of the learned solution.
- Zeyuan Allen-Zhu, Yuanzhi Li and Zhao Song “A convergence theory for deep learning via over-parameterization” In International Conference on Machine Learning, 2019, pp. 242–252 PMLR
- Shun-ichi Amari “Natural Gradient Works Efficiently in Learning” In Neural Computation 10.2, 1998, pp. 251–276 URL: https://www.mitpressjournals.org/doi/10.1162/089976698300017746
- “SGD with large step sizes learns sparse features” In arXiv preprint arXiv:2210.05337, 2022
- “Kernelized Wasserstein Natural Gradient”, 2019 URL: https://openreview.net/forum?id=Hklz71rYvS
- “Non-Convex Bilevel Games with Critical Point Selection Maps” In Advances in Neural Information Processing Systems (NeurIPS) 2022, 2022
- “Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks” In International Conference on Machine Learning, 2019, pp. 322–332 PMLR
- Mikhail Belkin, Alexander Rakhlin and Alexandre B Tsybakov “Does data interpolation contradict statistical optimality?” In The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 1611–1619 PMLR
- “Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process” In Conference on learning theory, 2020, pp. 483–513 PMLR
- Aleksandar Botev, Hippolyt Ritter and David Barber “Practical gauss-newton optimisation for deep learning” In International Conference on Machine Learning, 2017, pp. 557–565 PMLR
- Etienne Boursier, Loucas Pillaud-Vivien and Nicolas Flammarion “Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs” In arXiv preprint arXiv:2206.00939, 2022
- “Gram-gauss-newton method: Learning overparameterized neural networks for regression problems” In arXiv preprint arXiv:1905.11675, 2019
- Zhengdao Chen, Eric Vanden-Eijnden and Joan Bruna “On feature learning in neural networks with global convergence guarantees” In arXiv preprint arXiv:2204.10782, 2022
- Lenaic Chizat “Sparse Optimization on Measures with Over-parameterized Gradient Descent” arXiv: 1907.10300 In arXiv:1907.10300 [math, stat], 2019 URL: http://arxiv.org/abs/1907.10300
- “On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport”, 2018 URL: https://hal.archives-ouvertes.fr/hal-01798792
- Lenaic Chizat, Edouard Oyallon and Francis Bach “On lazy training in differentiable programming” In Advances in Neural Information Processing Systems 32, 2019
- Li Deng “The mnist database of handwritten digit images for machine learning research” In IEEE Signal Processing Magazine 29.6 IEEE, 2012, pp. 141–142
- “Gradient descent finds global minima of deep neural networks” In International conference on machine learning, 2019, pp. 1675–1685 PMLR
- “Gradient descent provably optimizes over-parameterized neural networks” In arXiv preprint arXiv:1810.02054, 2018
- Stefan Elfwing, Eiji Uchibe and Kenji Doya “Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. ArXiv e-prints (2017)” In arXiv preprint arXiv:1702.03118, 2017
- “Fisher-Legendre (FishLeg) optimization of deep neural networks” In The Eleventh International Conference on Learning Representations, 2023
- “A Kronecker-factored Approximate Fisher Matrix for Convolution Layers” event-place: New York, NY, USA In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16 JMLR.org, 2016, pp. 573–582 URL: http://dl.acm.org/citation.cfm?id=3045390.3045452
- “Surprises in high-dimensional ridgeless least squares interpolation” In The Annals of Statistics 50.2 Institute of Mathematical Statistics, 2022, pp. 949–986
- Arthur Jacot, Franck Gabriel and Clément Hongler “Neural Tangent Kernel: Convergence and Generalization in Neural Networks” arXiv: 1806.07572 In arXiv:1806.07572 [cs, math, stat], 2018 URL: http://arxiv.org/abs/1806.07572
- Barbara Kaltenbacher, Andreas Neubauer and AG Ramm “Convergence rates of the continuous regularized Gauss—Newton method” In Journal of Inverse and Ill-Posed Problems 10.3 De Gruyter, 2002, pp. 261–280
- Barbara Kaltenbacher, Andreas Neubauer and Otmar Scherzer “Iterative regularization methods for nonlinear ill-posed problems” In Iterative Regularization Methods for Nonlinear Ill-Posed Problems de Gruyter, 2008
- “Understanding approximate fisher information for fast convergence of natural gradient descent in wide neural networks” In Advances in neural information processing systems 33, 2020, pp. 10891–10901
- Anna Kerekes, Anna Mészáros and Ferenc Huszár “Depth Without the Magic: Inductive Bias of Natural Gradient Descent” In arXiv preprint arXiv:2111.11542, 2021
- Chaoyue Liu, Libin Zhu and Mikhail Belkin “Loss landscapes and optimization in over-parameterized non-linear systems and neural networks” In Applied and Computational Harmonic Analysis 59 Elsevier, 2022, pp. 85–116
- James Martens “Deep Learning via Hessian-free Optimization” event-place: Haifa, Israel In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10 USA: Omnipress, 2010, pp. 735–742 URL: http://dl.acm.org/citation.cfm?id=3104322.3104416
- James Martens “New insights and perspectives on the natural gradient method” In The Journal of Machine Learning Research 21.1 JMLRORG, 2020, pp. 5776–5851
- “Optimizing Neural Networks with Kronecker-factored Approximate Curvature” arXiv: 1503.05671 In arXiv:1503.05671 [cs, stat], 2015 URL: http://arxiv.org/abs/1503.05671
- “Training Deep and Recurrent Networks with Hessian-Free Optimization” In Neural Networks: Tricks of the Trade: Second Edition, Lecture Notes in Computer Science Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 479–535 DOI: 10.1007/978-3-642-35289-8_27
- “On the explicit role of initialization on the convergence and implicit bias of overparametrized linear networks” In International Conference on Machine Learning, 2021, pp. 7760–7768 PMLR
- Konstantin Mishchenko “Regularized Newton Method with Global O(1/k**2)O(1/k**2)italic_O ( 1 / italic_k * * 2 ) Convergence” In arXiv preprint arXiv:2112.02089, 2021
- Jorge J Moré “The Levenberg-Marquardt algorithm: implementation and theory” In Numerical Analysis: Proceedings of the Biennial Conference Held at Dundee, June 28–July 1, 1977, 2006, pp. 105–116 Springer
- Rotem Mulayoff, Tomer Michaeli and Daniel Soudry “The implicit bias of minima stability: A view from function space” In Advances in Neural Information Processing Systems 34, 2021, pp. 17749–17761
- Vinod Nair and Geoffrey E Hinton “Rectified linear units improve restricted boltzmann machines” In Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814
- Yu Nesterov “Modified Gauss–Newton scheme with worst case guarantees for global performance” In Optimisation methods and software 22.3 Taylor & Francis, 2007, pp. 469–483
- Behnam Neyshabur, Ryota Tomioka and Nathan Srebro “In search of the real inductive bias: On the role of implicit regularization in deep learning” In arXiv preprint arXiv:1412.6614, 2014
- Scott Pesme, Loucas Pillaud-Vivien and Nicolas Flammarion “Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity” In Advances in Neural Information Processing Systems 34, 2021, pp. 29218–29230
- Alexander G Ramm “Dynamical systems method for solving operator equations” In Communications in Nonlinear Science and Numerical Simulation 9.4 Elsevier, 2004, pp. 383–402
- “Efficient subsampled gauss-newton and natural gradient methods for training neural networks” In arXiv preprint arXiv:1906.02353, 2019
- David A. R. Robin, Kevin Scaman and Marc Lelarge “Convergence beyond the over-parameterized regime using Rayleigh quotients” In Advances in Neural Information Processing Systems, 2022 URL: https://openreview.net/forum?id=pl279jU4GOu
- Nicol N Schraudolph “Fast curvature matrix-vector products for second-order gradient descent” In Neural computation 14.7 MIT Press, 2002, pp. 1723–1738
- “On the importance of initialization and momentum in deep learning” In International Conference on Machine Learning, 2013, pp. 1139–1147 URL: http://proceedings.mlr.press/v28/sutskever13.html
- Alexandre B. Tsybakov “Introduction to Nonparametric Estimation”, Springer Series in Statistics New York: Springer-Verlag, 2009 DOI: 10.1007/b13794
- “Implicit Regularization in Overparameterized Bilevel Optimization” In ICML 2021 Beyond First Order Methods Workshop, 2021
- Stephan Wojtowytsch “On the convergence of gradient descent training for two-layer relu-networks in the mean field regime” In arXiv preprint arXiv:2005.13530, 2020
- “Kernel and rich regimes in overparametrized models” In Conference on Learning Theory, 2020, pp. 3635–3673 PMLR
- “Flexible modification of gauss-newton method and its stochastic extension” In arXiv preprint arXiv:2102.00810, 2021
- Chulhee Yun, Shankar Krishnan and Hossein Mobahi “A unifying view on implicit bias in training linear neural networks” In arXiv preprint arXiv:2010.02501, 2020
- Guodong Zhang, James Martens and Roger B Grosse “Fast convergence of natural gradient descent for over-parameterized neural networks” In Advances in Neural Information Processing Systems 32, 2019