Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking Gauss-Newton for learning over-parameterized models (2302.02904v3)

Published 6 Feb 2023 in cs.LG and math.OC

Abstract: This work studies the global convergence and implicit bias of Gauss Newton's (GN) when optimizing over-parameterized one-hidden layer networks in the mean-field regime. We first establish a global convergence result for GN in the continuous-time limit exhibiting a faster convergence rate compared to GD due to improved conditioning. We then perform an empirical study on a synthetic regression task to investigate the implicit bias of GN's method. While GN is consistently faster than GD in finding a global optimum, the learned model generalizes well on test data when starting from random initial weights with a small variance and using a small step size to slow down convergence. Specifically, our study shows that such a setting results in a hidden learning phenomenon, where the dynamics are able to recover features with good generalization properties despite the model having sub-optimal training and test performances due to an under-optimized linear layer. This study exhibits a trade-off between the convergence speed of GN and the generalization ability of the learned solution.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Zeyuan Allen-Zhu, Yuanzhi Li and Zhao Song “A convergence theory for deep learning via over-parameterization” In International Conference on Machine Learning, 2019, pp. 242–252 PMLR
  2. Shun-ichi Amari “Natural Gradient Works Efficiently in Learning” In Neural Computation 10.2, 1998, pp. 251–276 URL: https://www.mitpressjournals.org/doi/10.1162/089976698300017746
  3. “SGD with large step sizes learns sparse features” In arXiv preprint arXiv:2210.05337, 2022
  4. “Kernelized Wasserstein Natural Gradient”, 2019 URL: https://openreview.net/forum?id=Hklz71rYvS
  5. “Non-Convex Bilevel Games with Critical Point Selection Maps” In Advances in Neural Information Processing Systems (NeurIPS) 2022, 2022
  6. “Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks” In International Conference on Machine Learning, 2019, pp. 322–332 PMLR
  7. Mikhail Belkin, Alexander Rakhlin and Alexandre B Tsybakov “Does data interpolation contradict statistical optimality?” In The 22nd International Conference on Artificial Intelligence and Statistics, 2019, pp. 1611–1619 PMLR
  8. “Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process” In Conference on learning theory, 2020, pp. 483–513 PMLR
  9. Aleksandar Botev, Hippolyt Ritter and David Barber “Practical gauss-newton optimisation for deep learning” In International Conference on Machine Learning, 2017, pp. 557–565 PMLR
  10. Etienne Boursier, Loucas Pillaud-Vivien and Nicolas Flammarion “Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs” In arXiv preprint arXiv:2206.00939, 2022
  11. “Gram-gauss-newton method: Learning overparameterized neural networks for regression problems” In arXiv preprint arXiv:1905.11675, 2019
  12. Zhengdao Chen, Eric Vanden-Eijnden and Joan Bruna “On feature learning in neural networks with global convergence guarantees” In arXiv preprint arXiv:2204.10782, 2022
  13. Lenaic Chizat “Sparse Optimization on Measures with Over-parameterized Gradient Descent” arXiv: 1907.10300 In arXiv:1907.10300 [math, stat], 2019 URL: http://arxiv.org/abs/1907.10300
  14. “On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport”, 2018 URL: https://hal.archives-ouvertes.fr/hal-01798792
  15. Lenaic Chizat, Edouard Oyallon and Francis Bach “On lazy training in differentiable programming” In Advances in Neural Information Processing Systems 32, 2019
  16. Li Deng “The mnist database of handwritten digit images for machine learning research” In IEEE Signal Processing Magazine 29.6 IEEE, 2012, pp. 141–142
  17. “Gradient descent finds global minima of deep neural networks” In International conference on machine learning, 2019, pp. 1675–1685 PMLR
  18. “Gradient descent provably optimizes over-parameterized neural networks” In arXiv preprint arXiv:1810.02054, 2018
  19. Stefan Elfwing, Eiji Uchibe and Kenji Doya “Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. ArXiv e-prints (2017)” In arXiv preprint arXiv:1702.03118, 2017
  20. “Fisher-Legendre (FishLeg) optimization of deep neural networks” In The Eleventh International Conference on Learning Representations, 2023
  21. “A Kronecker-factored Approximate Fisher Matrix for Convolution Layers” event-place: New York, NY, USA In Proceedings of the 33rd International Conference on International Conference on Machine Learning - Volume 48, ICML’16 JMLR.org, 2016, pp. 573–582 URL: http://dl.acm.org/citation.cfm?id=3045390.3045452
  22. “Surprises in high-dimensional ridgeless least squares interpolation” In The Annals of Statistics 50.2 Institute of Mathematical Statistics, 2022, pp. 949–986
  23. Arthur Jacot, Franck Gabriel and Clément Hongler “Neural Tangent Kernel: Convergence and Generalization in Neural Networks” arXiv: 1806.07572 In arXiv:1806.07572 [cs, math, stat], 2018 URL: http://arxiv.org/abs/1806.07572
  24. Barbara Kaltenbacher, Andreas Neubauer and AG Ramm “Convergence rates of the continuous regularized Gauss—Newton method” In Journal of Inverse and Ill-Posed Problems 10.3 De Gruyter, 2002, pp. 261–280
  25. Barbara Kaltenbacher, Andreas Neubauer and Otmar Scherzer “Iterative regularization methods for nonlinear ill-posed problems” In Iterative Regularization Methods for Nonlinear Ill-Posed Problems de Gruyter, 2008
  26. “Understanding approximate fisher information for fast convergence of natural gradient descent in wide neural networks” In Advances in neural information processing systems 33, 2020, pp. 10891–10901
  27. Anna Kerekes, Anna Mészáros and Ferenc Huszár “Depth Without the Magic: Inductive Bias of Natural Gradient Descent” In arXiv preprint arXiv:2111.11542, 2021
  28. Chaoyue Liu, Libin Zhu and Mikhail Belkin “Loss landscapes and optimization in over-parameterized non-linear systems and neural networks” In Applied and Computational Harmonic Analysis 59 Elsevier, 2022, pp. 85–116
  29. James Martens “Deep Learning via Hessian-free Optimization” event-place: Haifa, Israel In Proceedings of the 27th International Conference on International Conference on Machine Learning, ICML’10 USA: Omnipress, 2010, pp. 735–742 URL: http://dl.acm.org/citation.cfm?id=3104322.3104416
  30. James Martens “New insights and perspectives on the natural gradient method” In The Journal of Machine Learning Research 21.1 JMLRORG, 2020, pp. 5776–5851
  31. “Optimizing Neural Networks with Kronecker-factored Approximate Curvature” arXiv: 1503.05671 In arXiv:1503.05671 [cs, stat], 2015 URL: http://arxiv.org/abs/1503.05671
  32. “Training Deep and Recurrent Networks with Hessian-Free Optimization” In Neural Networks: Tricks of the Trade: Second Edition, Lecture Notes in Computer Science Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 479–535 DOI: 10.1007/978-3-642-35289-8_27
  33. “On the explicit role of initialization on the convergence and implicit bias of overparametrized linear networks” In International Conference on Machine Learning, 2021, pp. 7760–7768 PMLR
  34. Konstantin Mishchenko “Regularized Newton Method with Global O(1/k**2)O(1/k**2)italic_O ( 1 / italic_k * * 2 ) Convergence” In arXiv preprint arXiv:2112.02089, 2021
  35. Jorge J Moré “The Levenberg-Marquardt algorithm: implementation and theory” In Numerical Analysis: Proceedings of the Biennial Conference Held at Dundee, June 28–July 1, 1977, 2006, pp. 105–116 Springer
  36. Rotem Mulayoff, Tomer Michaeli and Daniel Soudry “The implicit bias of minima stability: A view from function space” In Advances in Neural Information Processing Systems 34, 2021, pp. 17749–17761
  37. Vinod Nair and Geoffrey E Hinton “Rectified linear units improve restricted boltzmann machines” In Proceedings of the 27th international conference on machine learning (ICML-10), 2010, pp. 807–814
  38. Yu Nesterov “Modified Gauss–Newton scheme with worst case guarantees for global performance” In Optimisation methods and software 22.3 Taylor & Francis, 2007, pp. 469–483
  39. Behnam Neyshabur, Ryota Tomioka and Nathan Srebro “In search of the real inductive bias: On the role of implicit regularization in deep learning” In arXiv preprint arXiv:1412.6614, 2014
  40. Scott Pesme, Loucas Pillaud-Vivien and Nicolas Flammarion “Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity” In Advances in Neural Information Processing Systems 34, 2021, pp. 29218–29230
  41. Alexander G Ramm “Dynamical systems method for solving operator equations” In Communications in Nonlinear Science and Numerical Simulation 9.4 Elsevier, 2004, pp. 383–402
  42. “Efficient subsampled gauss-newton and natural gradient methods for training neural networks” In arXiv preprint arXiv:1906.02353, 2019
  43. David A. R. Robin, Kevin Scaman and Marc Lelarge “Convergence beyond the over-parameterized regime using Rayleigh quotients” In Advances in Neural Information Processing Systems, 2022 URL: https://openreview.net/forum?id=pl279jU4GOu
  44. Nicol N Schraudolph “Fast curvature matrix-vector products for second-order gradient descent” In Neural computation 14.7 MIT Press, 2002, pp. 1723–1738
  45. “On the importance of initialization and momentum in deep learning” In International Conference on Machine Learning, 2013, pp. 1139–1147 URL: http://proceedings.mlr.press/v28/sutskever13.html
  46. Alexandre B. Tsybakov “Introduction to Nonparametric Estimation”, Springer Series in Statistics New York: Springer-Verlag, 2009 DOI: 10.1007/b13794
  47. “Implicit Regularization in Overparameterized Bilevel Optimization” In ICML 2021 Beyond First Order Methods Workshop, 2021
  48. Stephan Wojtowytsch “On the convergence of gradient descent training for two-layer relu-networks in the mean field regime” In arXiv preprint arXiv:2005.13530, 2020
  49. “Kernel and rich regimes in overparametrized models” In Conference on Learning Theory, 2020, pp. 3635–3673 PMLR
  50. “Flexible modification of gauss-newton method and its stochastic extension” In arXiv preprint arXiv:2102.00810, 2021
  51. Chulhee Yun, Shankar Krishnan and Hossein Mobahi “A unifying view on implicit bias in training linear neural networks” In arXiv preprint arXiv:2010.02501, 2020
  52. Guodong Zhang, James Martens and Roger B Grosse “Fast convergence of natural gradient descent for over-parameterized neural networks” In Advances in Neural Information Processing Systems 32, 2019
Citations (5)

Summary

We haven't generated a summary for this paper yet.