Commutative Width and Depth Scaling in Deep Neural Networks (2310.01683v1)
Abstract: This paper is the second in the series Commutative Scaling of Width and Depth (WD) about commutativity of infinite width and depth limits in deep neural networks. Our aim is to understand the behaviour of neural functions (functions that depend on a neural network model) as width and depth go to infinity (in some sense), and eventually identify settings under which commutativity holds, i.e. the neural function tends to the same limit no matter how width and depth limits are taken. In this paper, we formally introduce and define the commutativity framework, and discuss its implications on neural network design and scaling. We study commutativity for the neural covariance kernel which reflects how network layers separate data. Our findings extend previous results established in [55] by showing that taking the width and depth to infinity in a deep neural network with skip connections, when branches are suitably scaled to avoid exploding behaviour, result in the same covariance structure no matter how that limit is taken. This has a number of theoretical and practical implications that we discuss in the paper. The proof techniques in this paper are novel and rely on tools that are more accessible to readers who are not familiar with stochastic calculus (used in the proofs of WD(I))).
- E. Fehlberg “Classical Fifth-, Sixth-, Seventh-, and Eighth-Order Runge-Kutta Formulas with Stepsize Control” In NASA Technical Report, 1968
- R.M. Neal “Bayesian Learning for Neural Networks” Springer Science & Business Media, 1995
- John Butcher “Numerical Methods for Ordinary Differential Equations”, 2003
- “Kernel Methods for Deep Learning” In Advances in Neural Information Processing Systems, 2009
- “Exponential expressivity in deep neural networks through transient chaos” In 30th Conference on Neural Information Processing Systems, 2016
- “Deep Information Propagation” In International Conference on Learning Representations, 2017
- “Mean field residual networks: On the edge of chaos” In Advances in neural information processing systems, 2017, pp. 7103–7114
- “Out-of-equilibrium dynamical mean-field equations for the perceptron model” In Journal of Physics A: Mathematical and Theoretical 51.8 IOP Publishing, 2018, pp. 085002
- “Deep Neural Networks as Gaussian Processes” In International Conference on Learning Representations, 2018
- “Gaussian Process Behaviour in Wide Deep Neural Networks” In International Conference on Learning Representations, 2018
- Dyego Araújo, Roberto I Oliveira and Daniel Yukimura “A mean-field limit for certain deep neural networks” In arXiv preprint arXiv:1906.00193, 2019
- “Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks” In Proceedings of the 36th International Conference on Machine Learning 97, Proceedings of Machine Learning Research PMLR, 2019, pp. 322–332
- Boris Hanin “Universal Function Approximation by Deep Neural Nets with Bounded Width and ReLU Activations” In Mathematics 7.10, 2019
- “Products of Many Large Random Matrices and Gradients in Deep Neural Networks” In Communications in Mathematical Physics 376.1 Springer ScienceBusiness Media LLC, 2019, pp. 287–322
- S. Hayou, A. Doucet and J. Rousseau “On the Impact of the Activation Function on Deep Neural Networks Training” In International Conference on Machine Learning, 2019
- Soufiane Hayou, Arnaud Doucet and Judith Rousseau “Training dynamics of deep networks using stochastic gradient descent via neural tangent kernel” arXiv, 2019
- Song Mei, Theodor Misiakiewicz and Andrea Montanari “Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit” In Conference on Learning Theory, 2019, pp. 2388–2464 PMLR
- “Understanding Priors in Bayesian Neural Networks at the Unit Level” In Proceedings of the 36th International Conference on Machine Learning 97, Proceedings of Machine Learning Research PMLR, 2019, pp. 6458–6467
- G. Yang “Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation” In arXiv preprint arXiv:1902.04760, 2019
- G. Yang “Tensor programs i: Wide feedforward or recurrent neural networks of any architecture are Gaussian processes” In arXiv preprint arXiv:1910.12478, 2019
- “A fine-grained spectral perspective on neural networks” In arXiv preprint arXiv:1907.10599, 2019
- “Finite Depth and Width Corrections to the Neural Tangent Kernel” In International Conference on Learning Representations, 2020
- S. Hayou, A. Doucet and J. Rousseau “Mean-field Behaviour of Neural Tangent Kernel for Deep Neural Networks” In arXiv preprint arXiv:1905.13654, 2020
- Bobby He, Balaji Lakshminarayanan and Yee Whye Teh “Bayesian Deep Ensembles via the Neural Tangent Kernel” In Advances in Neural Information Processing Systems 33 Curran Associates, Inc., 2020, pp. 1010–1022
- “Infinite attention: NNGP and NTK for deep attention networks” In Proceedings of the 37th International Conference on Machine Learning 119, Proceedings of Machine Learning Research PMLR, 2020, pp. 4376–4386
- Phan-Minh Nguyen and Huy Tuan Pham “A rigorous framework for the mean field limit of multilayer neural networks” In arXiv preprint arXiv:2001.11443, 2020
- “Infinitely deep neural networks as diffusion processes” In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics 108, Proceedings of Machine Learning Research PMLR, 2020, pp. 1126–1136
- Lechao Xiao, Jeffrey Pennington and Samuel Schoenholz “Disentangling Trainability and Generalization in Deep Neural Networks” In Proceedings of the 37th International Conference on Machine Learning 119, Proceedings of Machine Learning Research PMLR, 2020, pp. 10462–10472
- G. Yang “Tensor Programs III: Neural Matrix Laws” In arXiv preprint arXiv:2009.10685, 2020
- “Modeling from features: a mean-field framework for over-parameterized deep neural networks” In Conference on learning theory, 2021, pp. 1887–1936 PMLR
- “Robust Pruning at Initialization” In International Conference on Learning Representations, 2021
- “Regularization in ResNet with Stochastic Depth” In Proceedings of Thirty-fifth Neural Information Processing Systems (NeurIPS), 2021
- “Stable ResNet” In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics 130, Proceedings of Machine Learning Research PMLR, 2021, pp. 1324–1332
- Mufan Li, Mihai Nica and Dan Roy “The future is log-Gaussian: ResNets and their infinite-depth-and-width limit at initialization” In Advances in Neural Information Processing Systems 34 Curran Associates, Inc., 2021, pp. 7852–7864
- “Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping” In arXiv, preprint 2110.01765, 2021
- “Precise characterization of the prior predictive distribution of deep ReLU networks” In Advances in Neural Information Processing Systems, 2021 URL: https://openreview.net/forum?id=DTA7Bgrai-Q
- “Tensor Programs IV: Feature Learning in Infinite-Width Neural Networks” In ICML 2021, 2021
- “Exact marginal prior distributions of finite Bayesian neural networks” In Advances in Neural Information Processing Systems 34 Curran Associates, Inc., 2021, pp. 3364–3375 URL: https://proceedings.neurips.cc/paper_files/paper/2021/file/1baff70e2669e8376347efd3a874a341-Paper.pdf
- “Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks”, 2022 arXiv:2205.09653 [stat.ML]
- Boris Hanin “Correlation Functions in Random Fully Connected Neural Networks at Finite Width” arXiv, 2022
- Soufiane Hayou “On the infinite-depth limit of finite-width neural networks” In Transactions on Machine Learning Research, 2022
- Soufiane Hayou, Arnaud Doucet and Judith Rousseau “The Curse of Depth in Kernel Regime” In Proceedings on ”I (Still) Can’t Believe It’s Not Better!” at NeurIPS 2021 Workshops 163, Proceedings of Machine Learning Research PMLR, 2022, pp. 41–47
- Arthur Jacot “Theory of Deep Learning: Neural Tangent Kernel and Beyond” In PhD Thesis, Ecole Polytechnique Federale de Lausanne, 2022 URL: https://infoscience.epfl.ch/record/295831/files/EPFL_TH9825.pdf
- “Freeze and Chaos: NTK views on DNN Normalization, Checkerboard and Boundary Artifacts” In Proceedings of Mathematical and Scientific Machine Learning 190, Proceedings of Machine Learning Research PMLR, 2022, pp. 257–270
- Mufan Bill Li, Mihai Nica and Daniel M. Roy “The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at Initialization” In arXiv, 2022
- “Connecting Optimization and Generalization via Gradient Flow Path Length” arXiv, 2022
- Yizhang Lou, Chris E Mingard and Soufiane Hayou “Feature Learning and Signal Propagation in Deep Neural Networks” In Proceedings of the 39th International Conference on Machine Learning, 2022, pp. 14248–14282
- “Analyzing Finite Neural Networks: Can We Trust Neural Tangent Kernel Theory?” In Proceedings of the 2nd Mathematical and Scientific Machine Learning Conference 145, Proceedings of Machine Learning Research PMLR, 2022, pp. 868–895
- “Mean field analysis of deep neural networks” In Mathematics of Operations Research 47.1 INFORMS, 2022, pp. 120–152
- “Gaussian Pre-Activations in Neural Networks: Myth or Reality?” arXiv, 2022
- Greg Yang, Michael Santacroce and Edward J Hu “Efficient Computation of Deep Nonlinear Infinite-Width Neural Networks that Learn Features” In International Conference on Learning Representations, 2022
- Guodong Zhang, Aleksandar Botev and James Martens “Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers” In International Conference on Learning Representations, 2022
- “The Influence of Learning Rule on Representation Dynamics in Wide Neural Networks”, 2023 arXiv:2210.02157 [stat.ML]
- Nicola Muca Cirone, Maud Lemercier and Cristopher Salvi “Neural signature kernels as infinite-width-depth-limits of controlled ResNets”, 2023 arXiv:2303.17671 [math.DS]
- “Width and Depth Limits Commute in Residual Networks” In International Conference on Machine Learning, 2023 URL: https://api.semanticscholar.org/CorpusID:256459595
- “Depth Degeneracy in Neural Networks: Vanishing Angles in Fully Connected ReLU Networks on Initialization”, 2023 arXiv:2302.09712 [stat.ML]
- “The Shaped Transformer: Attention Models in the Infinite Depth-and-Width Limit”, 2023