Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Quadratic models for understanding catapult dynamics of neural networks (2205.11787v3)

Published 24 May 2022 in cs.LG, math.OC, and stat.ML

Abstract: While neural networks can be approximated by linear models as their width increases, certain properties of wide neural networks cannot be captured by linear models. In this work we show that recently proposed Neural Quadratic Models can exhibit the "catapult phase" [Lewkowycz et al. 2020] that arises when training such models with large learning rates. We then empirically show that the behaviour of neural quadratic models parallels that of neural networks in generalization, especially in the catapult phase regime. Our analysis further demonstrates that quadratic models can be an effective tool for analysis of neural networks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Yu Bai and Jason D Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. In International Conference on Learning Representations, 2019.
  2. Lyapunov theory for discrete time systems. arXiv preprint arXiv:1809.05289, 2018.
  3. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32, 2019.
  4. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pp. 1675–1685, 2019.
  5. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
  6. Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the neural tangent kernel. Advances in Neural Information Processing Systems, 33:5850–5861, 2020.
  7. Antonio Gulli. AG’s corpus of news articles. http://groups.di.unipi.it/~gulli/AG_corpus_of_news_articles.html.
  8. Dynamics of deep neural networks and neural tangent hierarchy. In International Conference on Machine Learning, pp. 4542–4551. PMLR, 2020.
  9. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.
  10. Jakobovski. Free-Spoken-Digit-Dataset. https://github.com/Jakobovski/free-spoken-digit-dataset.
  11. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. In International Conference on Learning Representations, 2019.
  12. Learning multiple layers of features from tiny images. 2009.
  13. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  14. Wide neural networks of any depth evolve as linear models under gradient descent. In Advances in neural information processing systems, pp. 8570–8581, 2019.
  15. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020.
  16. On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33, 2020.
  17. Philip M Long. Properties of the after kernel. arXiv preprint arXiv:2105.10585, 2021.
  18. Catapult dynamics and phase transitions in quadratic nets. arXiv preprint arXiv:2301.07737, 2023.
  19. The interpolation phase transition in neural networks: Memorization and generalization under lazy training. arXiv preprint arXiv:2007.12826, 2020.
  20. Reading digits in natural images with unsupervised feature learning. 2011.
  21. Increasing depth leads to U-shaped test risk in over-parameterized convolutional networks. In International Conference on Machine Learning Workshop on Over-parameterization: Pitfalls and Opportunities, 2021.
  22. What can linearized neural networks actually say about generalization? Advances in Neural Information Processing Systems, 34, 2021.
  23. Boris T Polyak. Introduction to optimization. Optimization Software, Inc, New York, 1987.
  24. Searching for activation functions. arXiv preprint arXiv:1710.05941, 2017.
  25. The principles of deep learning theory. Cambridge University Press, 2022.
  26. How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pp.  2667–2690. PMLR, 2019.
  27. Gradient dynamics of shallow univariate relu networks. Advances in Neural Information Processing Systems, 32, 2019.
  28. Feature learning in infinite-width neural networks. arXiv preprint arXiv:2011.14522, 2020.
  29. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. Advances in neural information processing systems, 32, 2019.
  30. An improved analysis of training over-parameterized deep neural networks. In Advances in Neural Information Processing Systems, pp. 2053–2062, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Libin Zhu (11 papers)
  2. Chaoyue Liu (23 papers)
  3. Adityanarayanan Radhakrishnan (22 papers)
  4. Mikhail Belkin (76 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.