Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bayesian Inference with Deep Weakly Nonlinear Networks (2405.16630v1)

Published 26 May 2024 in stat.ML, cs.AI, cs.LG, math.PR, and physics.data-an

Abstract: We show at a physics level of rigor that Bayesian inference with a fully connected neural network and a shaped nonlinearity of the form $\phi(t) = t + \psi t3/L$ is (perturbatively) solvable in the regime where the number of training datapoints $P$ , the input dimension $N_0$, the network layer widths $N$, and the network depth $L$ are simultaneously large. Our results hold with weak assumptions on the data; the main constraint is that $P < N_0$. We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature. We report the following results from the first-order computation: 1. When the width $N$ is much larger than the depth $L$ and training set size $P$, neural network Bayesian inference coincides with Bayesian inference using a kernel. The value of $\psi$ determines the curvature of a sphere, hyperbola, or plane into which the training data is implicitly embedded under the feature map. 2. When $LP/N$ is a small constant, neural network Bayesian inference departs from the kernel regime. At zero temperature, neural network Bayesian inference is equivalent to Bayesian inference using a data-dependent kernel, and $LP/N$ serves as an effective depth that controls the extent of feature learning. 3. In the restricted case of deep linear networks ($\psi=0$) and noisy data, we show a simple data model for which evidence and generalization error are optimal at zero temperature. As $LP/N$ increases, both evidence and generalization further improve, demonstrating the benefit of depth in benign overfitting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Statistical mechanics of deep learning beyond the infinite-width limit. arXiv preprint arXiv:2209.04882, 2022.
  2. Local kernel renormalization as a mechanism for feature learning in overparametrized convolutional neural networks. arXiv preprint arXiv:2307.11807, 2023.
  3. Structures of neural network effective theories. arXiv preprint arXiv:2305.02334, 2023.
  4. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  5. Optimal learning of deep random networks of extensive-width. arXiv preprint arXiv:2302.00375, 2023.
  6. Quantitative clts in deep neural networks. arXiv preprint arXiv:2307.06092, 2023.
  7. Critical feature learning in deep neural networks. arXiv preprint arXiv:2405.10761, 2024.
  8. Deep convolutional networks as shallow gaussian processes. arXiv preprint arXiv:1808.05587, 2018.
  9. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR, 2016.
  10. Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? In Advances in Neural Information Processing Systems, 2018.
  11. Boris Hanin. Random fully connected neural networks as perturbatively solvable hierarchies. arXiv preprint arXiv:2204.01058, 2022.
  12. Boris Hanin. Random neural networks in the infinite width limit as gaussian processes. Annals of Applied Probability (to appear). arXiv:2107.01562, 2023.
  13. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  14. Bayesian deep ensembles via the neural tangent kernel. Advances in neural information processing systems, 33:1010–1022, 2020.
  15. Products of many large random matrices and gradients in deep neural networks. Communications in Mathematical Physics, 376(1):287–322, 2020.
  16. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
  17. Wide bayesian neural networks have a simple weight posterior: theory and accelerated sampling. In International Conference on Machine Learning, pages 8926–8945. PMLR, 2022.
  18. Bayesian interpolation with deep linear networks. Proceedings of the National Academy of Sciences, 120(23):e2301345120, 2023.
  19. What are bayesian neural network posteriors really like? In International conference on machine learning, pages 4629–4640. PMLR, 2021.
  20. Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
  21. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  22. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  23. Deep neural networks as gaussian processes. ICML 2018 andarXiv:1711.00165, 2018.
  24. The neural covariance sde: Shaped infinite depth-and-width networks at initialization. NeurIPS 2022, 2022.
  25. Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization. Physical Review X, 11(3):031059, 2021.
  26. David JC MacKay. Bayesian interpolation. Neural computation, 4(3):415–447, 1992.
  27. A simple baseline for bayesian uncertainty in deep learning. Advances in Neural Information Processing Systems, 32, 2019.
  28. Precise characterization of the prior predictive distribution of deep relu networks. Advances in Neural Information Processing Systems, 34:20851–20862, 2021.
  29. Radford M Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pages 29–53. Springer, 1996.
  30. The shaped transformer: Attention models in the infinite depth-and-width limit. Advances in Neural Information Processing Systems, 36, 2024.
  31. A self consistent theory of gaussian processes captures feature learning effects in finite cnns. Advances in Neural Information Processing Systems, 34, 2021.
  32. Bayesian deep convolutional networks with many channels are gaussian processes. arXiv preprint arXiv:1810.05148, 2018.
  33. Deep neural network initialization with sparsity inducing activations. arXiv preprint arXiv:2402.16184, 2024.
  34. Gaussian processes for machine learning (gpml) toolbox. The Journal of Machine Learning Research, 11:3011–3015, 2010.
  35. The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks. Cambridge University Press, 2022.
  36. Separation of scales and a thermodynamic description of feature learning in some cnns. Nature Communications, 14(1):908, 2023.
  37. A correspondence between random neural networks and statistical field theory. arXiv preprint arXiv:1710.06570, 2017.
  38. Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33:4697–4708, 2020.
  39. Sho Yaida. Non-gaussian processes and neural networks at finite widths. MSML, 2020.
  40. Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019.
  41. Greg Yang. Tensor programs i: Wide feedforward or recurrent neural networks of any architecture are gaussian processes. arXiv preprint arXiv:1910.12478, 2019.
  42. Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. ICML 2021, arXiv:2006.14548, 2021.
  43. Deep learning without shortcuts: Shaping the kernel with tailored rectifiers. arXiv preprint arXiv:2203.08120, 2022.
  44. Exact marginal prior distributions of finite bayesian neural networks. Advances in Neural Information Processing Systems, 34, 2021.
  45. Contrasting random and learned features in deep bayesian linear regression. Phys. Rev. E, 105:064118, 2022.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com