Infinite-Width Bayesian Neural Networks

Updated 19 December 2025

Infinite-width Bayesian neural networks are a limiting regime where neural outputs converge to Gaussian or stable processes, enabling tractable Bayesian inference.
Posterior inference leverages closed-form GP regression or Student-t process limits, enhanced by hierarchical and heavy-tailed prior models.
These models offer improved uncertainty quantification and out-of-distribution calibration, though practical generalization may vary from finite-width counterparts.

Infinite-width Bayesian neural networks (BNNs) are a fundamental limiting regime in probabilistic deep learning, where the hidden-layer widths of neural architectures tend to infinity. In this setting, the random-function prior over outputs typically converges to a Gaussian process (GP), allowing closed-form Bayesian inference for regression and classification tasks. This correspondence relies on the central limit theorem under suitable scaling of weight variances, and has been extended to various architectures including convolutional networks, tensor networks, and deep linear networks. Recent work explores non-Gaussian generalizations, especially via heavy-tailed priors or more sophisticated hierarchical variance models, yielding richer stable or Student-t process limits. Uncertainty quantification, posterior inference, and MCMC sampling techniques all benefit from the mathematical tractability of infinite-width BNNs, but practical generalization differences remain between finite-width and infinite-width models, especially under model misspecification.

1. Mathematical Foundations: GP and Stable Process Limits

For fully-connected feedforward networks with $L$ layers, input dimension $d_{in}$ , output dimension $d_{out}$ , and layer widths $n_1,...,n_{L-1}$ , the pre-activations at layer $\ell$ take the form:

$f^{(0)}(x)=x,\qquad f^{(\ell)}(x) = W^{(\ell)}\,\varphi_\ell\bigl(f^{(\ell-1)}(x)\bigr) + b^{(\ell)},\quad \ell=1,\dots,L,$

with Gaussian priors on weights $W^{(\ell)}_{ij}\sim\mathcal{N}(0,\sigma_W^2/n_{\ell-1})$ and biases $b^{(\ell)}_i\sim\mathcal{N}(0,\sigma_b^2)$ (Caporali et al., 6 Feb 2025, Novak et al., 2019, Juengermann et al., 2022). As $n_1,...,n_{L-1}\to\infty$ , finite-dimensional distributions of $f(x)$ converge to a GP with kernel $K$ recursively constructed via integrals over (joint) Gaussian random variables, for instance:

$K^{(\ell)}(x,x') = \sigma_W^2\,\mathbb{E}_{(u,v)\sim\mathcal{N}_2(0,\,K^{(\ell-1)})}[\varphi_\ell(u)\varphi_\ell(v)] + \sigma_b^2.$

This extends to deep architectures, convolutional networks, tensor-network models, and can be implemented at scale (with GPU or TPU acceleration) leveraging libraries like Neural Tangents (Novak et al., 2019).

However, when instead prior weights are drawn from symmetric $\alpha$ -stable laws with infinite variance, the limiting function is a multivariate stable process, not a Gaussian process. In such cases, the process is specified via its characteristic function (no covariance), and conditionally Gaussian representations—where the stable law is realized as a scale mixture of Gaussians—enable tractable inference and feature learning (Loría et al., 2023, Loría et al., 2 Oct 2024).

2. Posterior Inference and Hierarchical Variance Models

BNNs with Gaussian weight priors allow posterior inference via closed-form GP regression. When a hierarchical prior is placed on the last-layer weights and the likelihood variance—for instance, sharing a random variance $\tau\sim\mathrm{InvGamma}(\alpha,\beta)$ —the infinite-width limit of the posterior over outputs is a Student-t process (Caporali et al., 6 Feb 2025). Marginalizing $\tau$ yields a multivariate Student-t density:

$p(f(X)|y_D) = \mathrm{StudentT}\left(\nu=2\alpha',\,\mu_n,\,\frac{\beta'}{\alpha'}\Sigma_n\right),$

with heavier polynomial tails than a GP, more robust uncertainty quantification, and tunable degrees of freedom $\nu$ .

For networks with unbounded (infinite variance) priors, posterior inference leverages a conditionally Gaussian representation via a positive stable mixing variable:

$z_j^{(\ell)}(x) \sim \mathcal{N}(0,\,s_+^{(\ell)}\Sigma^{(\ell)}), \quad s_+^{(\ell)}\sim S^+_{\alpha/2},$

and inference proceeds by sampling latent scales $s_+$ via Metropolis-Hastings, updating layer-wise random kernels recursively (Loría et al., 2 Oct 2024).

3. Uncertainty Quantification and Calibration

Infinite-width BNNs yield GPs with deterministic kernels, affording exact closed-form uncertainty quantification via GP formulas for posterior mean and covariance (Juengermann et al., 2022, Novak et al., 2019). The NNGP view treats network outputs as draws from a GP specified by the infinite-width kernel; the NTK construction—for gradient-descent-trained networks—yields an alternative GP kernel expressing training-time dynamics.

With Student-t or stable process limits, uncertainty bands become heavier-tailed, offering greater robustness to outliers and small-sample effects. In practice, marginalizing hierarchical variances or stable mixing variables yields uncertainty intervals with closer-to-nominal coverage in challenging non-smooth or discontinuous regression tasks (Caporali et al., 6 Feb 2025, Loría et al., 2023, Loría et al., 2 Oct 2024).

Overconfidence in standard finite-width ReLU BNNs—manifested as output variance growing only quadratically in distance from the training set—is corrected in the infinite-feature limit, where cubic variance growth ensures softmax probabilities converge to $1/C$ far away from the data, thereby “maximal uncertainty" (Kristiadi et al., 2020).

Empirical and theoretical calibration metrics such as NLL, Brier score, and Expected Calibration Error (ECE) confirm the superior out-of-distribution calibration of infinite-width GP-based BNNs under both regression and classification, especially for OOD data and distributional shift (Adlam et al., 2020).

4. Comparison to Finite-Width Bayesian Neural Networks

Finite-width BNNs define non-Gaussian, heavier-tailed priors and exhibit subtle dependence effects between units that vanish only in the infinite-width limit (Yao et al., 2022, Vladimirova et al., 2021, Lu, 2023). Empirical studies demonstrate finite BNNs have more flexible frequency spectra and may generalize better than NNGP models under model mismatch, especially for processes with significant high-frequency structure. However, inference in finite-width BNNs is intractable for large architectures, requiring expensive MCMC or variational techniques.

Depth in finite linear BNNs induces data-dependent scale averaging—each output channel is a scale mixture of GPs—which collapses to a single GP as width increases (Zavatone-Veth et al., 2021). Bottleneck architectures (finite intermediate layer widths) retain discriminative, input-dependent posteriors and output dependence even in deep limits, contrasting with total independence in fully infinite-width networks (Agrawal et al., 2020).

5. Stochastic Sampling: Function-space MCMC and Efficient Scaling

High-dimensional parameter-space MCMC methods often degenerate as network width increases; acceptance rates and mixing collapse unless proposal step sizes shrink with dimension. Function-space MCMC algorithms such as preconditioned Crank-Nicolson (pCN/pCNL)—formulated to preserve Gaussian measures in Hilbert space—circumvent this issue: as layer widths $\to\infty$ , acceptance rates approach unity and effective sample sizes per step scale favorably (Pezzetti et al., 26 Aug 2024). Empirical results show that for wide networks, pCN/pCNL samplers offer dramatically improved mixing over standard Langevin MCMC, and are well posed in both function and parameter space.

Libraries such as Neural Tangents automate the practical implementation of infinite-width inference, supporting both GPU/TPU acceleration and batched computation for posterior mean and covariance extraction (Novak et al., 2019).

6. Non-Gaussianity and Advanced Kernel Processes

Recent advances extend the classical infinite-width Gaussian regime to non-Gaussian processes driven by heavy-tailed priors, nontrivial variance hierarchies, or finite-feature corrections. Edgeworth expansions yield analytic quantification of kurtosis and non-Gaussian corrections at order $1/N$ in finite-width networks (Lu, 2023), while scale-mixture representations (via stable laws) generalize standard kernel learning to stochastic kernel processes that encode representation learning in the infinite-width limit (Loría et al., 2 Oct 2024).

Deep kernel processes under infinite variance priors admit recursive kernel construction analogous to Cho & Saul (2009), but with layer-wise conditional random covariance matrices linked by stable mixing variables and learned directly from data, enabling scalable and adaptive posterior inference well beyond conventional GP regression.

7. Practical Applications and Limitations

Infinite-width Bayesian neural networks underpin tractable uncertainty quantification methods, favor scalable function-space inference, and serve as a principled basis for transfer learning via GP posteriors on pre-trained embeddings (Adlam et al., 2020). Student-t and stable process generalizations provide heavier-tailed, robust models for discontinuous, outlier-prone data, and bottleneck architectures improve multi-output dependence and discriminative feature learning.

Limitations include the deterministic nature of GP/NNGP kernels in the classical Gaussian regime—precluding representation learning unless heavy-tailed or hierarchical priors are used—and limited scalability of fully Bayesian inference for finite-width architectures. Open directions involve quantifying optimal width-depth trade-offs under data/noise constraints, devising scalable non-Gaussian inference for large networks, and integrating adaptive kernel learning into function-space modeling for high-dimensional tasks.

Key References:

Student-t processes in infinite-width BNN posteriors (Caporali et al., 6 Feb 2025)
Bottleneck and composition phenomena (Agrawal et al., 2020)
Stable process limits and conditionally Gaussian kernels (Loría et al., 2 Oct 2024, Loría et al., 2023)
Infinite-feature corrections for ReLU BNNs (Kristiadi et al., 2020)
Function-space MCMC in wide BNNs (Pezzetti et al., 26 Aug 2024)
Uncertainty quantification and calibration (Juengermann et al., 2022, Adlam et al., 2020)
Neural Tangents implementation (Novak et al., 2019)
Empirical spectrum and generalization advantages (Yao et al., 2022, Vladimirova et al., 2021, Lu, 2023)
Scale-averaging in deep linear BNNs (Zavatone-Veth et al., 2021)