Differential Equation Scaling Limits of Shaped and Unshaped Neural Networks (2310.12079v2)
Abstract: Recent analyses of neural networks with shaped activations (i.e. the activation function is scaled as the network size grows) have led to scaling limits described by differential equations. However, these results do not a priori tell us anything about "ordinary" unshaped networks, where the activation is unchanged as the network size grows. In this article, we find similar differential equation based asymptotic characterization for two types of unshaped networks. Firstly, we show that the following two architectures converge to the same infinite-depth-and-width limit at initialization: (i) a fully connected ResNet with a $d{-1/2}$ factor on the residual branch, where $d$ is the network depth. (ii) a multilayer perceptron (MLP) with depth $d \ll$ width $n$ and shaped ReLU activation at rate $d{-1/2}$. Secondly, for an unshaped MLP at initialization, we derive the first order asymptotic correction to the layerwise correlation. In particular, if $\rho_\ell$ is the correlation at layer $\ell$, then $q_t = \ell2 (1 - \rho_\ell)$ with $t = \frac{\ell}{n}$ converges to an SDE with a singularity at $t=0$. These results together provide a connection between shaped and unshaped network architectures, and opens up the possibility of studying the effect of normalization methods and how it connects with shaping activation functions.
- The merged-staircase property: a necessary and nearly sufficient condition for sgd learning of sparse functions on two-layer neural networks. In Conference on Learning Theory, pp. 4782–4887. PMLR, 2022.
- High-dimensional asymptotics of feature learning: How one gradient step improves the representation. arXiv preprint arXiv:2205.01445, 2022.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Deep learning: a statistical viewpoint. Acta numerica, 30:87–201, 2021.
- Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit, 2023.
- Kernel methods for deep learning. In Advances in Neural Information Processing Systems (NeurIPS), pp. 342–350, 2009.
- Neural signature kernels as infinite-width-depth-limits of controlled resnets. arXiv preprint arXiv:2303.17671, 2023.
- Gradient descent finds global minima of deep neural networks. In Int. Conf. Machine Learning (ICML), pp. 1675–1685. PMLR, 2019.
- Markov processes: characterization and convergence. John Wiley & Sons, 2009.
- Optimal signal propagation in resnets through residual scaling. arXiv preprint arXiv:2305.07715, 2023.
- When do neural networks outperform kernel methods? Advances in Neural Information Processing Systems, 33:14820–14830, 2020.
- Finite depth and width corrections to the neural tangent kernel. In Int. Conf. Learning Representations (ICLR), 2019a.
- Products of many large random matrices and gradients in deep neural networks. Communications in Mathematical Physics, pp. 1–36, 2019b.
- Soufiane Hayou. On the infinite-depth limit of finite-width neural networks. Transactions on Machine Learning Research, 2022.
- Soufiane Hayou. Commutative width and depth scaling in deep neural networks, 2023.
- Width and depth limits commute in residual networks. arXiv preprint arXiv:2302.00453, 2023.
- Stable ResNet. In Int. Conf. Artificial Intelligence and Statistics (AISTATS), pp. 1324–1332. PMLR, 2021.
- Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proc. IEEE Int. Conf. Computer Vision, pp. 1026–1034, 2015.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pp. 448–456. pmlr, 2015.
- Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Information Processing Systems (NeurIPS), 2018.
- Depth degeneracy in neural networks: Vanishing angles in fully connected relu networks on initialization. arXiv preprint arXiv:2302.09712, 2023.
- O. Kallenberg. Foundations of Modern Probability. Probability theory and stochastic modelling. Springer, 2021. ISBN 9783030618728.
- Deep neural networks as gaussian processes. In Int. Conf. Learning Representations (ICLR), 2018.
- The future is log-gaussian: Resnets and their infinite-depth-and-width limit at initialization. Advances in Neural Information Processing Systems, 34, 2021.
- Geometric dyson brownian motion and the free log-normal for minor of products of random matrices. In Preparation, 2024.
- The neural covariance sde: Shaped infinite depth-and-width networks at initialization. arXiv preprint arXiv:2206.02768, 2022.
- Rapid training of deep neural networks without skip connections or normalization layers using deep kernel shaping. arXiv preprint arXiv:2110.01765, 2021.
- A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
- Towards training without depth limits: Batch normalization without gradient explosion, 2023.
- Sympy: symbolic computing in python. PeerJ Computer Science, 3:e103, January 2017. ISSN 2376-5992. doi: 10.7717/peerj-cs.103. URL https://doi.org/10.7717/peerj-cs.103.
- Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 1995.
- Precise characterization of the prior predictive distribution of deep relu networks. Advances in Neural Information Processing Systems, 34, 2021.
- The shaped transformer: Attention models in the infinite depth-and-width limit. arXiv preprint arXiv:2306.17759, 2023.
- Mean field analysis of neural networks: A law of large numbers, 2018.
- Multidimensional diffusion processes, volume 233. Springer Science & Business Media, 1997.
- Greg Yang. Tensor programs i: Wide feedforward or recurrent neural networks of any architecture are gaussian processes, 2019.
- Feature learning in infinite-width neural networks. In Int. Conf. Machine Learning (ICML), 2021.
- Tensor programs vi: Feature learning in infinite-depth neural networks, 2023.
- Deep learning without shortcuts: Shaping the kernel with tailored rectifiers. arXiv preprint arXiv:2203.08120, 2022.