Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generalization of Scaled Deep ResNets in the Mean-Field Regime (2403.09889v1)

Published 14 Mar 2024 in cs.LG

Abstract: Despite the widespread empirical success of ResNet, the generalization properties of deep ResNet are rarely explored beyond the lazy training regime. In this work, we investigate \emph{scaled} ResNet in the limit of infinitely deep and wide neural networks, of which the gradient flow is described by a partial differential equation in the large-neural network limit, i.e., the \emph{mean-field} regime. To derive the generalization bounds under this setting, our analysis necessitates a shift from the conventional time-invariant Gram matrix employed in the lazy training regime to a time-variant, distribution-dependent version. To this end, we provide a global lower bound on the minimum eigenvalue of the Gram matrix under the mean-field regime. Besides, for the traceability of the dynamic of Kullback-Leibler (KL) divergence, we establish the linear convergence of the empirical error and estimate the upper bound of the KL divergence over parameters distribution. Finally, we build the uniform convergence for generalization bound via Rademacher complexity. Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime and contribute to advancing the understanding of the fundamental properties of deep neural networks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Excess risk of two-layer relu neural networks in teacher-student settings and its superiority to kernel methods. arXiv preprint arXiv:2205.14818, 2022.
  2. A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  242–252. PMLR, 09–15 Jun 2019.
  3. A mean-field limit for certain deep neural networks. arXiv/1906.00193, 2019.
  4. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp.  322–332. PMLR, 2019.
  5. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. Advances in Neural Information Processing Systems, 35:37932–37946, 2022.
  6. Rezero is all you need: Fast convergence at large depth. In Uncertainty in Artificial Intelligence, pp.  1352–1361. PMLR, 2021.
  7. On global convergence of resnets: From finite to infinite width using linear parameterization. Advances in Neural Information Processing Systems, 35:16385–16397, 2022.
  8. A kernel perspective of skip connections in convolutional networks. arXiv preprint arXiv:2211.14810, 2022.
  9. Spectral analysis of the neural tangent kernel for deep residual networks. arXiv preprint arXiv:2104.03093, 2021.
  10. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, volume 32, 2019.
  11. When does gradient descent with logistic loss interpolate using deep networks with smoothed relu activations? arXiv/2102.04998, 2021.
  12. Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.
  13. On feature learning in neural networks with global convergence guarantees. arXiv preprint arXiv:2204.10782, 2022.
  14. A generalized neural tangent kernel analysis for two-layer neural networks. Advances in Neural Information Processing Systems, 33:13363–13373, 2020.
  15. L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in Neural Information Processing Systems, volume 31, 2018.
  16. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pp.  1305–1338. PMLR, 2020.
  17. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
  18. Convergence and implicit regularization properties of gradient descent for deep residual networks. arXiv preprint arXiv:2204.07261, 2022.
  19. On the global convergence of gradient descent for multi-layer resnets in the mean-field regime. arXiv preprint arXiv:2110.02926, 2021.
  20. Overparameterization of deep resnet: Zero loss and mean-field analysis. J. Mach. Learn. Res., 23:48–1, 2022.
  21. Asymptotic evaluation of certain markov process expectations for large time, i. Communications on Pure and Applied Mathematics, 28(1):1–47, 1975.
  22. Gradient descent finds global minima of deep neural networks. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  1675–1685, 09–15 Jun 2019a.
  23. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2019b.
  24. Machine learning from a continuous viewpoint, i. Science China Mathematics, 63(11):2233–2266, Sep 2020.
  25. Convex formulation of overparameterized deep neural networks. arXiv/1911/07626, 2019.
  26. Algorithm-dependent generalization bounds for overparameterized deep residual networks. In NeurIPS, 2019.
  27. When do neural networks outperform kernel methods? Advances in Neural Information Processing Systems, 33:14820–14830, 2020.
  28. Stable architectures for deep neural networks. Inverse problems, 34(1):014004, 2017.
  29. Width and depth limits commute in residual networks. In International Conference on Machine Learning, 2023.
  30. Exact convergence rates of the neural tangent kernel in the large depth limit. arXiv preprint arXiv:1905.13654, 2019.
  31. Stable resnet. In International Conference on Artificial Intelligence and Statistics, pp.  1324–1332. PMLR, 2021.
  32. Identity mappings in deep residual networks. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pp.  630–645. Springer, 2016.
  33. Understanding square loss in training overparametrized neural network classifiers. Advances in Neural Information Processing Systems, 35:16495–16508, 2022.
  34. Why do deep residual networks generalize better than deep feedforward networks?—a neural tangent kernel perspective. In Advances in Neural Information Processing Systems, volume 33, pp.  2698–2709, 2020.
  35. Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks. arXiv preprint arXiv:2006.07322, 2020.
  36. Mean-field neural odes via relaxed optimal control. arxiv/1912.05475, 2021.
  37. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, volume 31, 2018.
  38. The future is log-Gaussian: ResNets and their infinite-depth-and-width limit at initialization. In Advances in Neural Information Processing Systems, pp.  7852–7864, 2021.
  39. A mean field analysis of deep ResNet and beyond: Towards provably optimization via overparameterization from depth. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp.  6426–6436, 13–18 Jul 2020.
  40. Beyond finite layer neural networks: Bridging deep architectures and numerical differential equations. In International Conference on Machine Learning, pp.  3276–3285. PMLR, 2018.
  41. Machine learning from a continuous viewpoint, i. Science China Mathematics, 63(11):2233–2266, 2020.
  42. Beyond ntk with vanilla gradient descent: A mean-field analysis of neural networks with polynomial width, samples, and time. arXiv preprint arXiv:2306.16361, 2023.
  43. Scaling resnets in the large-depth regime. arXiv preprint arXiv:2206.06929, 2022.
  44. Implicit regularization of deep residual networks towards neural odes, 2023.
  45. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
  46. P. M. Nguyen. Mean field limit of the learning dynamics of multilayer neural networks. arXiv/1902.02880, 2019.
  47. Tight bounds on the smallest eigenvalue of the neural tangent kernel for deep relu networks. In International Conference on Machine Learning, pp.  8119–8129. PMLR, 2021.
  48. Global convergence of deep networks with one wide layer followed by pyramidal topology. Advances in Neural Information Processing Systems, 33:11961–11972, 2020.
  49. Generalization of an inequality by talagrand and links with the logarithmic sobolev inequality. Journal of Functional Analysis, 173(2):361–400, 2000.
  50. Torchdyn: implicit models and neural numerical methods in pytorch. In Neural Information Processing Systems, Workshop on Physical Reasoning and Inductive Biases for the Real World, volume 2, 2021.
  51. Wasserstein continuity of entropy and outer bounds for interference channels. IEEE Transactions on Information Theory, 62(7):3992–4002, 2016.
  52. Parameters as interacting particles: long time convergence and asymptotic error scaling of neural networks. Advances in neural information processing systems, 31, 2018.
  53. Do residual neural networks discretize neural ordinary differential equations? In Advances in Neural Information Processing Systems, pp.  36520–36532, 2022.
  54. J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130(3):1820–1852, 2020a.
  55. J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: A law of large numbers. SIAM Journal on Applied Mathematics, 80(2):725–752, 2020b.
  56. J. Sirignano and K. Spiliopoulos. Mean field analysis of deep neural networks. Mathematics of Operations Research, 2021. doi: 10.1287/moor.2020.1118.
  57. Transport analysis of infinitely deep neural network. The Journal of Machine Learning Research, 20(1):31–82, 2019.
  58. Feature learning via mean-field langevin dynamics: classifying sparse parities and beyond. In Advances in Neural Information Processing Systems, 2023.
  59. Kernel-based smoothness analysis of residual networks. In Mathematical and Scientific Machine Learning, pp.  921–954. PMLR, 2022.
  60. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. Advances in Neural Information Processing Systems, 32, 2019.
  61. Ee Weinan. A proposal on machine learning via dynamical systems. Communications in Mathematics and Statistics, 1(5):1–11, 2017.
  62. S. Wojtowytsch. On the convergence of gradient descent training for two-layer relu-networks in the mean field regime. arXiv/2005/13530, 2020.
  63. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pp.  3635–3673. PMLR, 2020.
  64. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pp.  11727–11737. PMLR, 2021.
  65. Generalization properties of nas under activation and skip connection search. Advances in Neural Information Processing Systems, 35:23551–23565, 2022.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets