Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit (2209.04882v5)

Published 11 Sep 2022 in cond-mat.dis-nn

Abstract: Despite the practical success of deep neural networks, a comprehensive theoretical framework that can predict practically relevant scores, such as the test accuracy, from knowledge of the training data is currently lacking. Huge simplifications arise in the infinite-width limit, where the number of units $N_\ell$ in each hidden layer ($\ell=1,\dots, L$, being $L$ the depth of the network) far exceeds the number $P$ of training examples. This idealisation, however, blatantly departs from the reality of deep learning practice. Here, we use the toolset of statistical mechanics to overcome these limitations and derive an approximate partition function for fully-connected deep neural architectures, which encodes information about the trained models. The computation holds in the ''thermodynamic limit'' where both $N_\ell$ and $P$ are large and their ratio $\alpha_\ell = P/N_\ell$ is finite. This advance allows us to obtain (i) a closed formula for the generalisation error associated to a regression task in a one-hidden layer network with finite $\alpha_1$; (ii) an approximate expression of the partition function for deep architectures (via an ''effective action'' that depends on a finite number of ''order parameters''); (iii) a link between deep neural networks in the proportional asymptotic limit and Student's $t$ processes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (MIT Press, 2016).
  2. A. Engel and C. Van den Broeck, Statistical Mechanics of Learning (Cambridge University Press, 2001).
  3. I. Seroussi, G. Naveh, and Z. Ringel, Separation of scales and a thermodynamic description of feature learning in some cnns, Nature Communications 14, 908 (2023).
  4. A. J. Wakhloo, T. J. Sussman, and S. Chung, Linear classification of neural manifolds with correlated variability, Phys. Rev. Lett. 131, 027301 (2023).
  5. Q. Li and H. Sompolinsky, Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization, Phys. Rev. X 11, 031059 (2021).
  6. A. Canatar, B. Bordelon, and C. Pehlevan, Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks, Nature communications 12, 1 (2021).
  7. A. Mozeika, B. Li, and D. Saad, Space of functions computed by deep-layered machines, Phys. Rev. Lett. 125, 168301 (2020).
  8. B. Li and D. Saad, Exploring the function space of deep-learning machines, Phys. Rev. Lett. 120, 248301 (2018).
  9. R. M. Neal, Priors for infinite networks, in Bayesian Learning for Neural Networks (Springer New York, New York, NY, 1996) pp. 29–53.
  10. C. Williams, Computing with infinite networks, in Advances in Neural Information Processing Systems, Vol. 9, edited by M. Mozer, M. Jordan, and T. Petsche (MIT Press, 1996).
  11. A. Garriga-Alonso, C. E. Rasmussen, and L. Aitchison, Deep convolutional networks as shallow gaussian processes, in International Conference on Learning Representations (2019).
  12. C. Cortes and V. Vapnik, Support-vector networks, Machine Learning 20, 273 (1995).
  13. R. Dietrich, M. Opper, and H. Sompolinsky, Statistical mechanics of support vector networks, Phys. Rev. Lett. 82, 2975 (1999).
  14. M. Seleznova and G. Kutyniok, Neural tangent kernel beyond the infinite-width limit: Effects of depth and initialization, arXiv preprint arXiv:2206.00553  (2022).
  15. N. Vyas, Y. Bansal, and N. Preetum, Limitations of the ntk for understanding generalization in deep learning, arXiv preprint arXiv:2206.10012  (2022).
  16. J. M. Antognini, Finite size corrections for neural network gaussian processes (2019), arXiv:1908.10030 [cs.LG] .
  17. S. Yaida, Non-Gaussian processes and neural networks at finite widths, in Proceedings of The First Mathematical and Scientific Machine Learning Conference, Proceedings of Machine Learning Research, Vol. 107, edited by J. Lu and R. Ward (PMLR, 2020) pp. 165–192.
  18. B. Hanin, Random fully connected neural networks as perturbatively solvable hierarchies (2023), arXiv:2204.01058 [math.PR] .
  19. Y. Bengio and O. Delalleau, On the expressive power of deep architectures, in International conference on algorithmic learning theory (Springer, 2011) pp. 18–36.
  20. P. Rotondo, M. C. Lagomarsino, and M. Gherardi, Counting the learnable functions of geometrically structured data, Phys. Rev. Research 2, 023169 (2020a).
  21. P. Rotondo, M. Pastore, and M. Gherardi, Beyond the storage capacity: Data-driven satisfiability transition, Phys. Rev. Lett. 125, 120601 (2020b).
  22. M. Gherardi, Solvable model for the linear separability of structured data, Entropy 23, 10.3390/e23030305 (2021).
  23. M. Pastore, Critical properties of the SAT/UNSAT transitions in the classification problem of structured data, Journal of Statistical Mechanics: Theory and Experiment 2021, 113301 (2021).
  24. F. Aguirre-López, M. Pastore, and S. Franz, Satisfiability transition in asymmetric neural networks, Journal of Physics A: Mathematical and Theoretical 55, 305001 (2022).
  25. A. M. Saxe, J. L. McClelland, and S. Ganguli, A mathematical theory of semantic development in deep neural networks, Proceedings of the National Academy of Sciences 116, 11537 (2019), https://www.pnas.org/doi/pdf/10.1073/pnas.1820226116 .
  26. J. A. Zavatone-Veth, W. L. Tong, and C. Pehlevan, Contrasting random and learned features in deep bayesian linear regression, Phys. Rev. E 105, 064118 (2022a).
  27. J.-M. Bardet and D. Surgailis, Moment bounds and central limit theorems for gaussian subordinated arrays, Journal of Multivariate Analysis 114, 457 (2013).
  28. I. Nourdin, G. Peccati, and M. Podolskij, Quantitative Breuer-Major theorems (2010).
  29. P. Breuer and P. Major, Central limit theorems for non-linear functionals of gaussian fields, Journal of Multivariate Analysis 13, 425 (1983).
  30. E. Dobriban and S. Wager, High-dimensional asymptotics of prediction: ridge regression and classification, The Annals of Statistics 46, 247 (2018).
  31. S. Mei and A. Montanari, The generalization error of random features regression: Precise asymptotics and the double descent curve, Communications on Pure and Applied Mathematics  (2019).
  32. J. A. Zavatone-Veth, W. L. Tong, and C. Pehlevan, Contrasting random and learned features in deep bayesian linear regression, Phys. Rev. E 105, 064118 (2022b).
  33. B. Hanin and A. Zlokapa, Bayesian interpolation with deep linear networks, Proceedings of the National Academy of Sciences 120, e2301345120 (2023).
  34. Y. Uchiyama, H. Oka, and A. Nono, Student’s t-process regression on the space of probability density functions, Proceedings of the ISCIE International Symposium on Stochastic Systems Theory and its Applications 2021, 1 (2021).
  35. L. Aitchison, Why bigger is not always better: on finite and infinite neural networks, in Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 119, edited by H. D. III and A. Singh (PMLR, 2020) pp. 156–164.
  36. B. D. Tracey and D. Wolpert, Upgrading from gaussian processes to student’st processes, in 2018 AIAA Non-Deterministic Approaches Conference (2018) p. 1659.
  37. D. A. Roberts, S. Yaida, and B. Hanin, The Principles of Deep Learning Theory (Cambridge University Press, 2022) https://deeplearningtheory.com, arXiv:2106.10165 [cs.LG] .
  38. H. Cui, F. Krzakala, and L. Zdeborov’a, Optimal learning of deep random networks of extensive-width, in International Conference on Machine Learning (2023).
  39. G. Pang, L. Yang, and G. E. Karniadakis, Neural-net-induced gaussian process regression for function approximation and pde solution, Journal of Computational Physics 384, 270 (2019).
  40. R. Pacelli, rpacelli/FC_deep_bayesian_networks: FC_deep_bayesian_networks (2023).
  41. V. A. Marčenko and L. A. Pastur, Distribution of eigenvalues for some sets of random matrices, Mathematics of the USSR-Sbornik 1, 457 (1967).
  42. P. Forrester, Log-Gases and Random Matrices (LMS-34), London Mathematical Society Monographs (Princeton University Press, 2010).
Citations (20)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com