Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit (2309.16620v2)
Abstract: The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses $\mu$P parameterized networks, where the optimal hyperparameters for small width networks transfer to networks with arbitrarily large width. However, in this scheme, hyperparameters do not transfer across depths. As a remedy, we study residual networks with a residual branch scale of $1/\sqrt{\text{depth}}$ in combination with the $\mu$P parameterization. We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and ImageNet. Furthermore, our empirical findings are supported and motivated by theory. Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit and show convergence of finite-size network dynamics towards this limit.
- Joseph M Antognini. Finite size corrections for neural network gaussian processes. arXiv preprint arXiv:1908.10030, 2019.
- Rezero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, pp. 1352–1361. PMLR, 2021.
- The shattered gradients problem: If resnets are the answer, then what is the question? In International Conference on Machine Learning, pp. 342–350. PMLR, 2017.
- On the distance between two neural networks and the stability of learning. Advances in Neural Information Processing Systems, 33:21370–21381, 2020.
- Automatic gradient descent: Deep learning without hyperparameters. arXiv preprint arXiv:2304.05187, 2023.
- The influence of learning rule on representation dynamics in wide neural networks. arXiv preprint arXiv:2210.02157, 2022a.
- Self-consistent dynamical field theory of kernel evolution in wide neural networks. Advances in Neural Information Processing Systems, 35:32240–32256, 2022b.
- Dynamics of finite width kernel and prediction fluctuations in mean field neural networks. arXiv preprint arXiv:2304.03408, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
- Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pp. 1305–1338. PMLR, 2020.
- On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
- Kernel methods for deep learning. Advances in neural information processing systems, 22, 2009.
- Neural signature kernels as infinite-width-depth-limits of controlled resnets. arXiv preprint arXiv:2303.17671, 2023.
- Batch normalization orthogonalizes representations in deep random networks. Advances in Neural Information Processing Systems, 34:4896–4906, 2021.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
- Optimal signal propagation in resnets through residual scaling. arXiv preprint arXiv:2305.07715, 2023.
- Rigorous dynamical mean field theory for stochastic gradient descent methods. arXiv preprint arXiv:2210.06591, 2022.
- Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? Advances in neural information processing systems, 31, 2018.
- Boris Hanin. Correlation functions in random fully connected neural networks at finite width. arXiv preprint arXiv:2204.01058, 2022.
- Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989, 2019.
- Products of many large random matrices and gradients in deep neural networks. Communications in Mathematical Physics, 376(1):287–322, 2020.
- Bayesian interpolation with deep linear networks. Proceedings of the National Academy of Sciences, 120(23):e2301345120, 2023.
- Soufiane Hayou. On the infinite-depth limit of finite-width neural networks. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=RbLsYz1Az9.
- Width and depth limits commute in residual networks. arXiv preprint arXiv:2302.00453, 2023.
- Stable resnet. In International Conference on Artificial Intelligence and Statistics, pp. 1324–1332. PMLR, 2021.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Improving transformer optimization through better initialization. In International Conference on Machine Learning, pp. 4475–4483. PMLR, 2020.
- Kiyosi Itô. 109. stochastic integral. Proceedings of the Imperial Academy, 20(8):519–524, 1944.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Depth dependence of μ𝜇\muitalic_μp learning rates in relu mlps. arXiv preprint arXiv:2305.07810, 2023.
- On the impact of activation and normalization in obtaining isometric embeddings at initialization. arXiv preprint arXiv:2305.18399, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Scaling laws for deep learning based image reconstruction. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=op-ceGueqc4.
- Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
- Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019.
- The future is log-gaussian: Resnets and their infinite-depth-and-width limit at initialization. Advances in Neural Information Processing Systems, 34:7852–7864, 2021.
- The neural covariance sde: Shaped infinite depth-and-width networks at initialization. Advances in Neural Information Processing Systems, 35:10795–10808, 2022.
- Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization. Physical Review X, 11(3):031059, 2021.
- Statistical dynamics of classical systems. Physical Review A, 8(1):423, 1973.
- Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 2018.
- Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pp. 2388–2464. PMLR, 2019.
- Dynamical mean-field theory for stochastic gradient descent in gaussian mixture classification. Advances in Neural Information Processing Systems, 33:9540–9550, 2020.
- Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
- Precise characterization of the prior predictive distribution of deep relu networks. Advances in Neural Information Processing Systems, 34:20851–20862, 2021.
- Signal propagation in transformers: Theoretical perspectives and the role of rank collapse. Advances in Neural Information Processing Systems, 35:27198–27211, 2022.
- The shaped transformer: Attention models in the infinite depth-and-width limit. arXiv preprint arXiv:2306.17759, 2023.
- OpenAI. Gpt-4 technical report, 2023.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Walter Rudin. Principles of mathematical analysis. 1953.
- Unified field theoretical approach to deep and recurrent neuronal networks. Journal of Statistical Mechanics: Theory and Experiment, 2022(10):103401, 2022.
- Masato Taki. Deep residual networks and weight initialization. arXiv preprint arXiv:1709.02956, 2017.
- Dynamical isometry is achieved in residual networks in a universal way for any activation function. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 2221–2230. PMLR, 2019.
- Llama: open and efficient foundation language models, 2023. URL https://arxiv. org/abs/2302.13971.
- Feature-learning networks are consistent across widths at realistic scales, 2023.
- On layer normalization in the transformer architecture. In International Conference on Machine Learning, pp. 10524–10533. PMLR, 2020.
- Sho Yaida. Non-gaussian processes and neural networks at finite widths. In Mathematical and Scientific Machine Learning, pp. 165–192. PMLR, 2020.
- Tuning large neural networks via zero-shot hyperparameter transfer. Advances in Neural Information Processing Systems, 34:17084–17097, 2021.
- Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint arXiv:2006.14548, 2020.
- Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pp. 11727–11737. PMLR, 2021.
- Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
- Advances in Neural Information Processing Systems, 34:3364–3375, 2021.
- Asymptotics of representation learning in finite bayesian neural networks. Advances in neural information processing systems, 34:24765–24777, 2021.
- Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12104–12113, 2022.
- Fixup initialization: Residual learning without normalization. arXiv preprint arXiv:1901.09321, 2019.
- Blake Bordelon (27 papers)
- Lorenzo Noci (17 papers)
- Mufan Bill Li (10 papers)
- Boris Hanin (50 papers)
- Cengiz Pehlevan (81 papers)