Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Statistical Theory of Deep Learning: Approximation, Training Dynamics, and Generative Models (2401.07187v3)

Published 14 Jan 2024 in stat.ML, cs.LG, math.ST, and stat.TH

Abstract: In this article, we review the literature on statistical theories of neural networks from three perspectives: approximation, training dynamics and generative models. In the first part, results on excess risks for neural networks are reviewed in the nonparametric framework of regression (and classification in Appendix~{\color{blue}B}). These results rely on explicit constructions of neural networks, leading to fast convergence rates of excess risks. Nonetheless, their underlying analysis only applies to the global minimizer in the highly non-convex landscape of deep neural networks. This motivates us to review the training dynamics of neural networks in the second part. Specifically, we review papers that attempt to answer ``how the neural network trained via gradient-based methods finds the solution that can generalize well on unseen data.'' In particular, two well-known paradigms are reviewed: the Neural Tangent Kernel (NTK) paradigm, and Mean-Field (MF) paradigm. Last but not least, we review the most recent theoretical advancements in generative models including Generative Adversarial Networks (GANs), diffusion models, and in-context learning (ICL) in the LLMs from two perpsectives reviewed previously, i.e., approximation and training dynamics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (150)
  1. A closer look at in-context learning under distribution shifts. arXiv preprint arXiv:2305.16704, 2023.
  2. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022.
  3. What can resnet learn efficiently, going beyond kernels? Advances in Neural Information Processing Systems, 32, 2019.
  4. A convergence theory for deep learning via over-parameterization. arXiv e-prints, pages arXiv–1811, 2018.
  5. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in Neural Information Processing Systems, 32:6158–6169, 2019.
  6. Brian DO Anderson. Reverse-time diffusion equation models. Stochastic Processes and their Applications, 12(3):313–326, 1982.
  7. Neural network learning: Theoretical foundations, volume 9. Cambridge University Press, Cambridge, UK, 1999.
  8. Generalization and equilibrium in generative adversarial nets (gans). In International conference on machine learning, pages 224–232. PMLR, 2017.
  9. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019a.
  10. On exact computation with an infinitely wide neural net. Advances in Neural Information Processing Systems, 32:8141–8150, 2019b.
  11. Fast learning rates for plug-in classifiers. The Annals of statistics, 35(2):608–633, 2007.
  12. Yu Bai and Jason D Lee. Beyond linearization: On quadratic and higher-order approximation of wide neural networks. In International Conference on Learning Representations, 2019.
  13. Approximability of discriminators implies diversity in gans. arXiv preprint arXiv:1806.10586, 2018.
  14. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. arXiv preprint arXiv:2306.04637, 2023.
  15. Analysis and geometry of Markov diffusion operators, volume 103. Springer, 2014.
  16. Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information theory, 39(3):930–945, 1993.
  17. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301, 2019.
  18. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
  19. Deep learning: a statistical viewpoint. Acta Numerica, 30:87–201, 2021. doi: 10.1017/S0962492921000027.
  20. Mikhail Belkin. Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. Acta Numerica, 30:203–248, 2021.
  21. Convex neural networks. Advances in neural information processing systems, 18, 2005.
  22. On the inductive bias of neural tangent kernels. In NeurIPS 2019-Thirty-third Conference on Neural Information Processing Systems, volume 32, pages 12873–12884, 2019.
  23. Shallow and deep networks are near-optimal approximators of Korobov functions. In International Conference on Learning Representations, 2022.
  24. Generative modeling with denoising auto-encoders and langevin sampling. arXiv preprint arXiv:2002.00107, 2020.
  25. Towards understanding the spectral bias of deep learning. arXiv preprint arXiv:1912.01198, 2019.
  26. Neural network approximation and estimation of classifiers with classification boundary in a barron class. The Annals of Applied Probability, 33(4):3039–3079, 2023.
  27. Efficient approximation of deep ReLU networks for functions on low dimensional manifolds. Advances in neural information processing systems, 32, 2019a.
  28. Nonparametric regression on low-dimensional manifolds using deep ReLU networks: Function approximation and statistical recovery. To appear in Information and Inference: a Journal of the IMA, 2019b.
  29. Distribution approximation and statistical estimation guarantees of generative adversarial networks. arXiv preprint arXiv:2002.03938, 2020a.
  30. Nonparametric regression on low-dimensional manifolds using deep ReLU networks: Function approximation and statistical recovery. Information and Inference: A Journal of the IMA, 11(4):1203–1253, 2022a.
  31. Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data. arXiv preprint arXiv:2302.07194, 2023a.
  32. Synthetic data in machine learning for medicine and healthcare. Nature Biomedical Engineering, 5(6):493–497, 2021.
  33. Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions. arXiv preprint arXiv:2209.11215, 2022b.
  34. The probability flow ode is provably fast. arXiv preprint arXiv:2305.11798, 2023b.
  35. A generalized neural tangent kernel analysis for two-layer neural networks. Advances in Neural Information Processing Systems, 33, 2020b.
  36. A note on lazy training in supervised differentiable programming.(2018). arXiv preprint arXiv:1812.07956, 1812.
  37. On the global convergence of gradient descent for over-parameterized models using optimal transport. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, pages 3040–3050, 2018.
  38. Generating multi-label discrete patient records using generative adversarial networks. In Machine learning for healthcare conference, pages 286–305. PMLR, 2017.
  39. Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers. arXiv preprint arXiv:2212.10559, 2022.
  40. Valentin De Bortoli. Convergence of denoising diffusion models under the manifold hypothesis. arXiv preprint arXiv:2208.05314, 2022.
  41. Diffusion schrödinger bridge with applications to score-based generative modeling. Advances in Neural Information Processing Systems, 34:17695–17709, 2021.
  42. Optimal nonlinear approximation. Manuscripta Mathematica, 63(4):469–478, 1989.
  43. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  44. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
  45. Minimax estimation via wavelet shrinkage. The Annals of Statistics, 26(3):879–921, 1998.
  46. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pages 1675–1685. PMLR, 2019.
  47. Gradient descent provably optimizes over-parameterized neural networks. In International Conference on Learning Representations, 2018.
  48. Cynthia Dwork. Differential privacy: A survey of results. In International conference on theory and applications of models of computation, pages 1–19. Springer, 2008.
  49. Real-valued (medical) time series generation with recurrent conditional gans. arXiv preprint arXiv:1706.02633, 2017.
  50. A guide to deep learning in healthcare. Nature medicine, 25(1):24–29, 2019.
  51. A selective overview of deep learning. Statistical science: a review journal of the Institute of Mathematical Statistics, 36(2):264, 2021.
  52. Optimal learning rates of deep convolutional neural networks: Additive ridge functions. arXiv preprint arXiv:2202.12119, 2022.
  53. Deep neural networks for estimation and inference. Econometrica, 89(1):181–213, 2021.
  54. Generalization analysis of CNNs for classification on spheres. IEEE Transactions on Neural Networks and Learning Systems, 2021.
  55. Spherical harmonics in p dimensions. arXiv preprint arXiv:1205.3548, 2012.
  56. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
  57. Discussion of:“nonparametric regression using deep neural networks with relu activation function”. 2020a.
  58. When do neural networks outperform kernel methods? arXiv preprint arXiv:2006.13409, 2020b.
  59. Looped transformers as programmable computers. arXiv preprint arXiv:2301.13196, 2023.
  60. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  61. Deep learning. MIT press, 2016.
  62. A survey of deep learning techniques for autonomous driving. Journal of Field Robotics, 37(3):362–386, 2020.
  63. A review on generative adversarial networks: Algorithms, theory, and applications. IEEE transactions on knowledge and data engineering, 35(4):3313–3332, 2021.
  64. In-context learning of large language models explained as kernel regression. arXiv preprint arXiv:2305.12766, 2023.
  65. Depth selection for deep ReLU nets in feature extraction and generalization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  66. Boris Hanin. Universal function approximation by deep neural nets with bounded width and relu activations. Mathematics, 7(10):992, 2019.
  67. Deep learning for finance: deep portfolios. Applied Stochastic Models in Business and Industry, 33(1):3–12, 2017.
  68. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  69. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
  70. Optimal rate of convergence for deep neural network classifiers under the teacher-student setting. arXiv preprint arXiv:2001.06892, 2020.
  71. Regularization matters: A nonparametric perspective on overparametrized neural network. In International Conference on Artificial Intelligence and Statistics, pages 829–837. PMLR, 2021.
  72. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. In International Conference on Learning Representations, 2019.
  73. In-context convergence of transformers. arXiv preprint arXiv:2310.05249, 2023.
  74. Deep neural networks learn non-smooth functions effectively. In The 22nd international conference on artificial intelligence and statistics, pages 869–878. PMLR, 2019.
  75. Advantage of deep neural networks for estimating functions with singularity on hypersurfaces. Journal of Machine Learning Research, 23:1–54, 2022.
  76. Neural tangent kernel: Convergence and generalization in neural networks. In NeurIPS, 2018.
  77. On computation and generalization of generative adversarial networks under spectrum control. In International Conference on Learning Representations, 2018.
  78. Deep nonparametric regression on approximately low-dimensional manifolds. arXiv preprint arXiv:2104.06708, 2021.
  79. Nonparametric estimation of composite functions. 2009.
  80. Stasy: Score-based tabular data synthesis. In The Eleventh International Conference on Learning Representations, 2022.
  81. Fast convergence rates of deep neural networks for classification. Neural Networks, 138:179–197, 2021.
  82. On excess risk convergence rates of neural network classifiers. Submitted to IEEE Transactions on Information Theory, 2023.
  83. Generalization ability of wide neural networks on m⁢a⁢t⁢h⁢b⁢b⁢R𝑚𝑎𝑡ℎ𝑏𝑏𝑅mathbb{R}italic_m italic_a italic_t italic_h italic_b italic_b italic_R. arXiv preprint arXiv:2302.05933, 2023.
  84. Convergence for score-based generative modeling with polynomial complexity. Advances in Neural Information Processing Systems, 35:22870–22882, 2022.
  85. Convergence of score-based generative modeling for general data distributions. In International Conference on Algorithmic Learning Theory, pages 946–985. PMLR, 2023.
  86. Deep neural networks as Gaussian processes. In International Conference on Learning Representations, 2018.
  87. Statistical theory of differentially private marginal-based data synthesis algorithms. arXiv preprint arXiv:2301.08844, 2023a.
  88. Transformers as algorithms: Generalization and implicit model selection in in-context learning. arXiv preprint arXiv:2301.07067, 2023b.
  89. Tengyuan Liang. How well generative adversarial networks learn distributions. The Journal of Machine Learning Research, 22(1):10366–10406, 2021.
  90. Let us build bridges: Understanding and extending diffusion generative models. arXiv preprint arXiv:2208.14699, 2022.
  91. Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021.
  92. Optimally tackling covariate shift in rkhs-based nonparametric regression. The Annals of Statistics, 51(2):738–761, 2023.
  93. Smooth discrimination analysis. The Annals of Statistics, 27(6):1808–1829, 1999.
  94. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
  95. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Conference on Learning Theory, pages 2388–2464. PMLR, 2019.
  96. Learning functions: when is deep better than shallow. arXiv preprint arXiv:1603.00988, 2016.
  97. Hrushikesh N Mhaskar. Neural networks for optimal approximation of smooth and analytic functions. Neural computation, 8(1):164–177, 1996.
  98. New error bounds for deep ReLU networks using sparse grids. SIAM Journal on Mathematics of Data Science, 1(1):78–92, 2019.
  99. Alfred Müller. Integral probability metrics and their generating classes of functions. Advances in applied probability, 29(2):429–443, 1997.
  100. Diffusion probabilistic models beat gans on medical images. arXiv preprint arXiv:2212.07501, 2022.
  101. Behnam Neyshabur. Implicit regularization in deep learning. arXiv preprint arXiv:1709.01953, 2017.
  102. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
  103. A rigorous framework for the mean field limit of multilayer neural networks. arXiv preprint arXiv:2001.11443, 2020.
  104. Stochastic particle gradient descent for infinite ensembles. arXiv preprint arXiv:1712.05438, 2017.
  105. Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime. In International Conference on Learning Representations, 2020.
  106. Diffusion models are minimax optimal distribution estimators. arXiv preprint arXiv:2303.01861, 2023.
  107. A survey of the usages of deep learning for natural language processing. IEEE transactions on neural networks and learning systems, 32(2):604–624, 2020.
  108. Improving adversarial robustness through the contrastive-guided diffusion process. In International Conference on Machine Learning, pages 26699–26723. PMLR, 2023a.
  109. Missdiff: Training diffusion models on tabular data with missing values. arXiv preprint arXiv:2307.00467, 2023b.
  110. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020.
  111. A new similarity measure for covariate shift with applications to nonparametric regression. In International Conference on Machine Learning, pages 17517–17530. PMLR, 2022.
  112. Optimal approximation of piecewise smooth functions using deep ReLU neural networks. Neural Networks, 108:296–330, 2018.
  113. A regularity class for the roots of nonnegative functions. Annali di Matematica Pura ed Applicata (1923-), 196:2091–2103, 2017.
  114. Lenya Ryzhik. Lecture notes for math 272, winter 2023. pages 1–183, 2023.
  115. Applied stochastic differential equations, volume 10. Cambridge University Press, 2019.
  116. Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function. The Annals of Statistics, 48(4):1875–1897, 2020.
  117. Statistical guarantees for generative models without domination. In Algorithmic Learning Theory, pages 1051–1071. PMLR, 2021.
  118. Approximation with cnns in sobolev space: with applications to classification. Advances in Neural Information Processing Systems, 35:2876–2888, 2022.
  119. Deep network approximation characterized by number of neurons. arXiv preprint arXiv:1906.05497, 2021.
  120. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2020.
  121. Quadratic suffices for over-parametrization via matrix chernoff bound. arXiv preprint arXiv:1906.03593, 2019.
  122. Machine learning in non-stationary environments: Introduction to covariate shift adaptation. MIT press, 2012.
  123. A non-parametric regression viewpoint: Generalization of overparametrized deep relu network under noisy observations. In International Conference on Learning Representations, 2021.
  124. Approximation and non-parametric estimation of functions over high-dimensional spheres via deep relu networks. In The Eleventh International Conference on Learning Representations, 2022.
  125. Autodiff: combining auto-encoder and diffusion model for tabular data synthesizing. arXiv preprint arXiv:2310.15479, 2023.
  126. Taiji Suzuki. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. In International Conference on Learning Representations, 2018.
  127. Deep learning is adaptive to intrinsic dimensionality of model smoothness in anisotropic besov space. Advances in Neural Information Processing Systems, 34:3609–3621, 2021.
  128. Minimax rate of distribution estimation on unknown submanifold under adversarial losses. arXiv preprint arXiv:2202.09030, 2022.
  129. Alexander B Tsybakov. Optimal aggregation of classifiers in statistical learning. The Annals of Statistics, 32(1):135–166, 2004.
  130. Sara A Van de Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.
  131. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  132. Transformers learn in-context by gradient descent. In International Conference on Machine Learning, pages 35151–35174. PMLR, 2023.
  133. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. Advances in Neural Information Processing Systems, 32, 2019.
  134. Global convergence of adaptive gradient methods for an over-parameterized neural network. arXiv preprint arXiv:1902.07111, 2019.
  135. Why do artificially generated data help adversarial robustness. Advances in Neural Information Processing Systems, 35:954–966, 2022a.
  136. Unlabeled data help: Minimax analysis and adversarial robustness. In International Conference on Artificial Intelligence and Statistics, pages 136–168. PMLR, 2022b.
  137. Utility theory of synthetic data generation. arXiv preprint arXiv:2305.10015, 2023a.
  138. Binary classification under local label differential privacy using randomized response mechanisms. Transactions on Machine Learning Research, 2023b.
  139. Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint arXiv:2006.14548, 2020.
  140. Feature learning in infinite-width neural networks. arXiv preprint arXiv:2011.14522, 2020.
  141. Diffusion models: A comprehensive survey of methods and applications. arXiv preprint arXiv:2209.00796, 2022.
  142. Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114, 2017.
  143. On the power and limitations of random features for understanding neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  144. Bayes-optimal classifiers under group fairness. arXiv preprint arXiv:2202.09724, 2022.
  145. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
  146. On the discrimination-generalization tradeoff in gans. arXiv preprint arXiv:1711.02771, 2017.
  147. Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927, 2023.
  148. Classification of data generated by gaussian mixture models using deep relu networks. arXiv preprint arXiv:2308.08030, 2023.
  149. An improved analysis of training over-parameterized deep neural networks. Advances in neural information processing systems, 2019.
  150. Stochastic gradient descent optimizes over-parameterized deep ReLU networks. arxiv e-prints, art. arXiv preprint arXiv:1811.08888, 2018.
Citations (7)

Summary

  • The paper establishes robust statistical foundations for deep learning by analyzing function approximation, gradient-based training dynamics, and generative models.
  • It demonstrates how hierarchical compositional structures and intrinsic dimensions help neural networks overcome high-dimensional challenges.
  • The survey highlights NTK and MF paradigms to explain neural network generalization and effective feature learning.

A Survey on Statistical Theory of Deep Learning: Approximation, Training Dynamics, and Generative Models

The paper "A Survey on Statistical Theory of Deep Learning: Approximation, Training Dynamics, and Generative Models" by Namjoon Suh and Guang Cheng provides a comprehensive review of the statistical foundations underpinning deep learning. It unifies theoretical insights from approximation theory, training dynamics, and generative models, which are central to understanding the capabilities and limitations of neural networks.

Approximation Theory

The initial part of the paper concentrates on how neural networks approximate functions within certain classes, with a focus on nonparametric regression and classification tasks. It emphasizes that explicit constructions of networks can yield fast convergence rates of excess risks. These constructions involve determining the width and depth of networks based on sample size, data dimension, and function smoothness. The paper reveals that neural networks exhibit statistical advantages over traditional methods, such as wavelets and kernel estimators, particularly when the target functions have a compositional structure.

The authors discuss the challenge of overcoming the curse of dimensionality by leveraging hierarchical compositional structures. This approach enables neural networks to achieve minimax optimal rates that are dependent on intrinsic rather than ambient dimensions, thus providing a strategic advantage in high-dimensional settings.

Training Dynamics

The review transitions to an examination of training dynamics, specifically how gradient-based methods discover solutions that generalize well. Two paradigms are highlighted: the Neural Tangent Kernel (NTK) and Mean-Field (MF) regimes. The NTK regime has shown to enable kernel-like behaviors in sufficiently wide networks, where the dynamics can be described by linear approximations. Conversely, the MF regime allows for more significant deviations from initialization, underlining deeper feature learning capabilities.

Understanding these dynamics is pivotal as they offer insights into why overparameterized networks, trained through gradient descent, generalize well despite fitting noisy or random data. The paper emphasizes that while NTK provides a solid theoretical framework, it does not fully encapsulate the generalization power of neural networks, which often exceeds kernel-based predictions.

Generative Models

In the domain of generative models, the paper discusses advancements in Generative Adversarial Networks (GANs), diffusion models, and In-Context Learning (ICL) in LLMs. Theoretical analyses of GANs focus on statistical approximations and highlight the role of well-specified network architectures in achieving strong generalization bounds.

Moreover, diffusion models, particularly score-based versions, have been recognized for their superior performance in generating high-quality synthetic data. The paper emphasizes the necessity for improved theoretical understanding of diffusion processes to leverage their full potential effectively. ICL in LLMs is explored as an exemplar of how these models can adapt to few-shot learning scenarios, showcasing the adaptive power of deep learning in language tasks.

Implications and Future Directions

The authors conclude by identifying promising directions for future research in the statistical theory of deep learning. They emphasize the importance of understanding the role of synthetic data, handling distribution shifts, and enhancing robust AI systems. Theoretical investigations into these areas are essential, given their broad applicability and potential to address challenges related to fairness, privacy, and robustness in AI applications.

The paper delivers profound insights into the intrinsic complexity and adaptability of deep learning models. It encourages future work to integrate theoretical advancements with practical implementations, aiming to develop more efficient, generalized, and reliable AI systems.

HackerNews