Bayesian Inference with Deep Weakly Nonlinear Networks (2405.16630v1)
Abstract: We show at a physics level of rigor that Bayesian inference with a fully connected neural network and a shaped nonlinearity of the form $\phi(t) = t + \psi t3/L$ is (perturbatively) solvable in the regime where the number of training datapoints $P$ , the input dimension $N_0$, the network layer widths $N$, and the network depth $L$ are simultaneously large. Our results hold with weak assumptions on the data; the main constraint is that $P < N_0$. We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature. We report the following results from the first-order computation: 1. When the width $N$ is much larger than the depth $L$ and training set size $P$, neural network Bayesian inference coincides with Bayesian inference using a kernel. The value of $\psi$ determines the curvature of a sphere, hyperbola, or plane into which the training data is implicitly embedded under the feature map. 2. When $LP/N$ is a small constant, neural network Bayesian inference departs from the kernel regime. At zero temperature, neural network Bayesian inference is equivalent to Bayesian inference using a data-dependent kernel, and $LP/N$ serves as an effective depth that controls the extent of feature learning. 3. In the restricted case of deep linear networks ($\psi=0$) and noisy data, we show a simple data model for which evidence and generalization error are optimal at zero temperature. As $LP/N$ increases, both evidence and generalization further improve, demonstrating the benefit of depth in benign overfitting.
- Statistical mechanics of deep learning beyond the infinite-width limit. arXiv preprint arXiv:2209.04882, 2022.
- Local kernel renormalization as a mechanism for feature learning in overparametrized convolutional neural networks. arXiv preprint arXiv:2307.11807, 2023.
- Structures of neural network effective theories. arXiv preprint arXiv:2305.02334, 2023.
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Optimal learning of deep random networks of extensive-width. arXiv preprint arXiv:2302.00375, 2023.
- Quantitative clts in deep neural networks. arXiv preprint arXiv:2307.06092, 2023.
- Critical feature learning in deep neural networks. arXiv preprint arXiv:2405.10761, 2024.
- Deep convolutional networks as shallow gaussian processes. arXiv preprint arXiv:1808.05587, 2018.
- Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR, 2016.
- Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? In Advances in Neural Information Processing Systems, 2018.
- Boris Hanin. Random fully connected neural networks as perturbatively solvable hierarchies. arXiv preprint arXiv:2204.01058, 2022.
- Boris Hanin. Random neural networks in the infinite width limit as gaussian processes. Annals of Applied Probability (to appear). arXiv:2107.01562, 2023.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Bayesian deep ensembles via the neural tangent kernel. Advances in neural information processing systems, 33:1010–1022, 2020.
- Products of many large random matrices and gradients in deep neural networks. Communications in Mathematical Physics, 376(1):287–322, 2020.
- Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409, 2017.
- Wide bayesian neural networks have a simple weight posterior: theory and accelerated sampling. In International Conference on Machine Learning, pages 8926–8945. PMLR, 2022.
- Bayesian interpolation with deep linear networks. Proceedings of the National Academy of Sciences, 120(23):e2301345120, 2023.
- What are bayesian neural network posteriors really like? In International conference on machine learning, pages 4629–4640. PMLR, 2021.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Deep neural networks as gaussian processes. ICML 2018 andarXiv:1711.00165, 2018.
- The neural covariance sde: Shaped infinite depth-and-width networks at initialization. NeurIPS 2022, 2022.
- Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization. Physical Review X, 11(3):031059, 2021.
- David JC MacKay. Bayesian interpolation. Neural computation, 4(3):415–447, 1992.
- A simple baseline for bayesian uncertainty in deep learning. Advances in Neural Information Processing Systems, 32, 2019.
- Precise characterization of the prior predictive distribution of deep relu networks. Advances in Neural Information Processing Systems, 34:20851–20862, 2021.
- Radford M Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pages 29–53. Springer, 1996.
- The shaped transformer: Attention models in the infinite depth-and-width limit. Advances in Neural Information Processing Systems, 36, 2024.
- A self consistent theory of gaussian processes captures feature learning effects in finite cnns. Advances in Neural Information Processing Systems, 34, 2021.
- Bayesian deep convolutional networks with many channels are gaussian processes. arXiv preprint arXiv:1810.05148, 2018.
- Deep neural network initialization with sparsity inducing activations. arXiv preprint arXiv:2402.16184, 2024.
- Gaussian processes for machine learning (gpml) toolbox. The Journal of Machine Learning Research, 11:3011–3015, 2010.
- The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks. Cambridge University Press, 2022.
- Separation of scales and a thermodynamic description of feature learning in some cnns. Nature Communications, 14(1):908, 2023.
- A correspondence between random neural networks and statistical field theory. arXiv preprint arXiv:1710.06570, 2017.
- Bayesian deep learning and a probabilistic perspective of generalization. Advances in neural information processing systems, 33:4697–4708, 2020.
- Sho Yaida. Non-gaussian processes and neural networks at finite widths. MSML, 2020.
- Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019.
- Greg Yang. Tensor programs i: Wide feedforward or recurrent neural networks of any architecture are gaussian processes. arXiv preprint arXiv:1910.12478, 2019.
- Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. ICML 2021, arXiv:2006.14548, 2021.
- Deep learning without shortcuts: Shaping the kernel with tailored rectifiers. arXiv preprint arXiv:2203.08120, 2022.
- Exact marginal prior distributions of finite bayesian neural networks. Advances in Neural Information Processing Systems, 34, 2021.
- Contrasting random and learned features in deep bayesian linear regression. Phys. Rev. E, 105:064118, 2022.