Super Consistency of Neural Network Landscapes and Learning Rate Transfer (2402.17457v2)
Abstract: Recently, there has been growing evidence that if the width and depth of a neural network are scaled toward the so-called rich feature learning limit (\mup and its depth extension), then some hyperparameters -- such as the learning rate -- exhibit transfer from small to very large models. From an optimization perspective, this phenomenon is puzzling, as it implies that the loss landscape is consistently similar across very different model sizes. In this work, we study the landscape through the lens of the loss Hessian, with a focus on its largest eigenvalue (i.e. the sharpness), and find that certain spectral properties under $\mu$P are largely independent of the size of the network, and remain consistent as training progresses. We name this property Super Consistency of the landscape. On the other hand, we show that in the Neural Tangent Kernel (NTK) and other scaling regimes, the sharpness exhibits very different dynamics at different scales. But what causes these differences in the sharpness dynamics? Through a connection between the Hessian's and the NTK's spectrum, we argue that the cause lies in the presence (for $\mu$P) or progressive absence (for the NTK scaling) of feature learning. We corroborate our claims with a substantial suite of experiments, covering a wide range of datasets and architectures: from ResNets and Vision Transformers trained on benchmark vision datasets to Transformers-based LLMs trained on WikiText.
- Understanding the unstable convergence of gradient descent. In International Conference on Machine Learning, pages 247–257. PMLR, 2022.
- On exact computation with an infinitely wide neural net. Advances in neural information processing systems, 32, 2019.
- Understanding gradient descent on the edge of stability in deep learning. In International Conference on Machine Learning, pages 948–1024. PMLR, 2022.
- J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of machine learning research, 13(2), 2012.
- B. Bordelon and C. Pehlevan. Self-consistent dynamical field theory of kernel evolution in wide neural networks. Advances in Neural Information Processing Systems, 35:32240–32256, 2022.
- B. Bordelon and C. Pehlevan. Dynamics of finite width kernel and prediction fluctuations in mean field neural networks. arXiv preprint arXiv:2304.03408, 2023.
- Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit. arXiv preprint arXiv:2309.16620, 2023.
- L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
- L. Chizat and F. Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305–1338. PMLR, 2020.
- L. Chizat and P. Netrapalli. Steering deep feature learning with backward aligned feature updates. arXiv preprint arXiv:2311.18718, 2023.
- On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
- M. Claesen and B. D. Moor. Hyperparameter search in machine learning, 2015.
- Gradient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065, 2021.
- Adaptive gradient methods at the edge of stability. arXiv preprint arXiv:2207.14484, 2022.
- Self-stabilization: The implicit bias of gradient descent at the edge of stability. arXiv preprint arXiv:2209.15594, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning, pages 1165–1173. PMLR, 2017.
- Deep convolutional networks as shallow gaussian processes. arXiv preprint arXiv:1808.05587, 2018.
- G. Garrigos and R. M. Gower. Handbook of convergence theorems for (stochastic) gradient methods. arXiv preprint arXiv:2301.11235, 2023.
- Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.
- Sgd: General analysis and improved rates. In International conference on machine learning, pages 5200–5209. PMLR, 2019.
- B. Hanin and M. Nica. Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989, 2019.
- B. Hanin and M. Nica. Products of many large random matrices and gradients in deep neural networks. Communications in Mathematical Physics, 376(1):287–322, 2020.
- S. Hayou. On the infinite-depth limit of finite-width neural networks. arXiv preprint arXiv:2210.00688, 2022.
- S. Hayou and G. Yang. Width and depth limits commute in residual networks. arXiv preprint arXiv:2302.00453, 2023.
- Stable resnet. In International Conference on Artificial Intelligence and Statistics, pages 1324–1332. PMLR, 2021.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
- Hyperparameter transfer learning with adaptive complexity. In International Conference on Artificial Intelligence and Statistics, pages 1378–1386. PMLR, 2021.
- Infinite attention: Nngp and ntk for deep attention networks. In International Conference on Machine Learning, pages 4376–4386. PMLR, 2020.
- Maximal initial learning rates in deep relu networks. In International Conference on Machine Learning, pages 14500–14530. PMLR, 2023.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- K. Jamieson and A. Talwalkar. Non-stochastic best arm identification and hyperparameter optimization. In Artificial intelligence and statistics, pages 240–248. PMLR, 2016.
- The break-even point on optimization trajectories of deep neural networks. arXiv preprint arXiv:2002.09572, 2020.
- On the relation between the sharpest directions of dnn loss and the sgd step length. arXiv preprint arXiv:1807.05031, 2018.
- Depth dependence of mup learning rates in relu mlps. arXiv preprint arXiv:2305.07810, 2023.
- D. S. Kalra and M. Barkeshli. Phase diagram of early training dynamics in deep neural networks: effect of the learning rate, depth, and width. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Universal sharpness dynamics in neural network training: Fixed point analysis, edge of stability, and route to chaos. arXiv preprint arXiv:2311.02076, 2023.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- e. a. LeCun, Yann. Efficient backprop. Neural networks: Tricks of the trade. Berlin, Heidelberg: Springer Berlin Heidelberg, 2002.
- Efficient backprop. In Neural networks: Tricks of the trade, pages 9–50. Springer, 2002.
- Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
- Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019.
- The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020.
- Hyperband: A novel bandit-based approach to hyperparameter optimization. Journal of Machine Learning Research, 18(185):1–52, 2018.
- The future is log-gaussian: Resnets and their infinite-depth-and-width limit at initialization. Advances in Neural Information Processing Systems, 34:7852–7864, 2021.
- The neural covariance sde: Shaped infinite depth-and-width networks at initialization. Advances in Neural Information Processing Systems, 35:10795–10808, 2022.
- M. B. Li and M. Nica. Differential equation scaling limits of shaped and unshaped neural networks. arXiv preprint arXiv:2310.12079, 2023.
- Towards explaining the regularization effect of initial large learning rate in training neural networks. Advances in Neural Information Processing Systems, 32, 2019.
- Gradient-based hyperparameter optimization through reversible learning. In International conference on machine learning, pages 2113–2122. PMLR, 2015.
- J. Martens. Second-order optimization for neural networks. University of Toronto (Canada), 2016.
- J. Martens. New insights and perspectives on the natural gradient method. The Journal of Machine Learning Research, 21(1):5776–5851, 2020.
- Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pages 2388–2464. PMLR, 2019.
- R. M. Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, 1995.
- Y. Nesterov. Introductory lectures on convex optimization: A basic course, volume 87. Springer Science & Business Media, 2013.
- Precise characterization of the prior predictive distribution of deep relu networks. Advances in Neural Information Processing Systems, 34:20851–20862, 2021.
- Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural Information Processing Systems, 35:27198–27211, 2022.
- The shaped transformer: Attention models in the infinite depth-and-width limit. arXiv preprint arXiv:2306.17759, 2023.
- Vanishing curvature and the power of adaptive methods in randomly initialized deep networks. arXiv preprint arXiv:2106.03763, 2021.
- Asdl: A unified interface for gradient preconditioning in pytorch. arXiv preprint arXiv:2305.04684, 2023.
- E. Ott. Chaos in dynamical systems. Cambridge university press, 2002.
- Scalable hyperparameter transfer learning. Advances in neural information processing systems, 31, 2018.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016.
- Truncated back-propagation for bilevel optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1723–1732. PMLR, 2019.
- Analytic insights into structure and rank of neural network hessian maps. Advances in Neural Information Processing Systems, 34:23914–23927, 2021.
- L. N. Smith and N. Topin. Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, volume 11006, pages 369–386. SPIE, 2019.
- Practical bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25, 2012.
- Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pages 2171–2180. PMLR, 2015.
- M. Song and C. Yun. Trajectory alignment: understanding the edge of stability phenomenon via bifurcation theory. arXiv preprint arXiv:2307.04204, 2023.
- Hyperparameter transfer across developer adjustments. arXiv preprint arXiv:2010.13117, 2020.
- Feature-learning networks are consistent across widths at realistic scales. arXiv preprint arXiv:2305.18411, 2023.
- S. Yaida. Non-gaussian processes and neural networks at finite widths. In Mathematical and Scientific Machine Learning, pages 165–192. PMLR, 2020.
- S. Yaida. Meta-principled family of hyperparameter scaling strategies. arXiv preprint arXiv:2210.04909, 2022.
- G. Yang. Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint arXiv:2006.14548, 2020.
- G. Yang and E. J. Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pages 11727–11737. PMLR, 2021.
- G. Yang and E. Littwin. Tensor programs ivb: Adaptive optimization in the infinite-width limit. arXiv preprint arXiv:2308.01814, 2023.
- Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
- Tensor programs vi: Feature learning in infinite-depth neural networks. arXiv preprint arXiv:2310.02244, 2023.
- Pyhessian: Neural networks through the lens of the hessian. In 2020 IEEE international conference on big data (Big data), pages 581–590. IEEE, 2020.
- D. Yogatama and G. Mann. Efficient transfer learning method for automatic hyperparameter tuning. In Artificial intelligence and statistics, pages 1077–1085. PMLR, 2014.
- Asymptotics of representation learning in finite bayesian neural networks. Advances in neural information processing systems, 34:24765–24777, 2021.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023. URL http://arxiv.org/abs/2303.18223.
- Understanding edge-of-stability training dynamics with a minimalist example. arXiv preprint arXiv:2210.03294, 2022.
- Lorenzo Noci (17 papers)
- Alexandru Meterez (5 papers)
- Thomas Hofmann (121 papers)
- Antonio Orvieto (47 papers)