Infinite-Width Limit: Theory & Implications

Updated 9 May 2026

Infinite-Width Limit is the regime where neural network widths tend to infinity, simplifying training dynamics to Gaussian process and deterministic kernel behaviors.
The framework distinguishes between lazy (kernel) and feature-learning regimes using scaling phase diagrams that guide initialization and architecture design.
Quantitative convergence bounds and extensions to non-Gaussian, deep proportional, and structured models offer practical insights for theoretical and applied research.

The infinite-width limit refers to a mathematical regime in which the width (number of neurons per layer) of a neural network tends to infinity. In this asymptotic limit, both the initialization and training dynamics of deep networks often simplify dramatically, revealing universal behaviors such as Gaussian process (GP) priors, deterministic training kernels, and phase diagrams of scaling regimes. The infinite-width analysis provides a principled foundation for understanding neural networks' expressivity, generalization, implicit regularization, and feature learning capacities.

1. Foundations: Definition, Gaussian Process Limit, and Kernel Recursions

In fully connected feed-forward networks, sending all hidden layer widths $n_1, n_2, ..., n_L \to \infty$ at fixed depth and fixed input/output dimensions, with standard weight and bias scaling ( $\mathcal N(0, \sigma_w^2/n_{l-1})$ for weights, $\mathcal N(0, \sigma_b^2)$ for biases), the output functions converge in law to a Gaussian process (Hanin, 2021, Bahri et al., 2023, Giovagnini et al., 4 May 2026). The layerwise covariances are defined recursively: $\begin{align*} K^{(0)}(x, x') &= \sigma_b^2 + \sigma_w^2 x \cdot x' / n_0 \ K^{(l)}(x, x') &= \sigma_b^2 + \sigma_w^2 \mathbb{E}_{(u, v)\sim N(0, \Sigma^{(l-1)})}\bigl[ \phi(u) \phi(v) \bigr] \end{align*}$ for $l = 1,\ldots,L$ , with $\Sigma^{(l-1)}$ built from $K^{(l-1)}$ .

At infinite width and for fixed $L$ , the joint law of network outputs on any finite input set is multivariate Gaussian. For Bayesian neural networks, infinitesimal learning rates and quadratic losses, the posterior over functions becomes that of a Gaussian process with the so-called NNGP (Neural Network GP) kernel (Pacelli et al., 2022, Bahri et al., 2023). Similarly, the neural tangent kernel (NTK) (Bahri et al., 2023) can be computed with a coupled recursion, and governs the linearized dynamics of gradient flow around initialization.

2. Dynamical Phase Structure: NTK, Mean-Field, and Condensed Regimes

The infinite-width limit supports a taxonomy of learning regimes determined by how initialization and learning rates scale with respect to width (Luo et al., 2020, Golikov, 2020, Golikov, 2020, Yang et al., 2020). For two-layer ReLU networks,

$f_{\theta}^\alpha(x) = \frac{1}{\alpha} \sum_{k=1}^m a_k \sigma(w_k^\top x)$

scaling the initialization and rescaling with parameters

$\kappa = \frac{\beta_1 \beta_2}{\alpha}, \quad \gamma = -\lim_{m\to\infty} \frac{\log \kappa}{\log m}, \quad \gamma' = -\lim_{m\to\infty} \frac{\log(\beta_1/\beta_2)}{\log m}$

yields three regimes as $\mathcal N(0, \sigma_w^2/n_{l-1})$ 0 (Luo et al., 2020):

Linear regime ( $\mathcal N(0, \sigma_w^2/n_{l-1})$ 1 or $\mathcal N(0, \sigma_w^2/n_{l-1})$ 2): vanishing movement of input weights ( $\mathcal N(0, \sigma_w^2/n_{l-1})$ 3). Training is governed by a fixed NTK, and the network behaves as a kernel method with exponentially convergent loss.
Critical regime (boundary, $\mathcal N(0, \sigma_w^2/n_{l-1})$ 4 and $\mathcal N(0, \sigma_w^2/n_{l-1})$ 5): $\mathcal N(0, \sigma_w^2/n_{l-1})$ 6 changes in weights. Dynamics reduce to a nonlinear mean-field PDE for neuron distributions, with nontrivial (non-kernel) feature evolution.
Condensed regime ( $\mathcal N(0, \sigma_w^2/n_{l-1})$ 7 and $\mathcal N(0, \sigma_w^2/n_{l-1})$ 8): diverging change in weights; neurons cluster at discrete orientations. Implies a strongly nonlinear regime with an emergent finite set of active features.

This phase diagram precisely predicts both qualitative and quantitative properties observed in synthetic and real tasks. For deeper or more general networks, a similar diagram arises from the scaling analysis of learning rates and weight variances (Golikov, 2020, Golikov, 2020, Yang et al., 2020).

3. Training Dynamics: Kernel vs. Feature Learning Limits

The NTK and mean-field (MF) limits are distinct in the infinite-width regime:

NTK (Kernel/Lazy Training): For standard or "NTK" parameterizations, the trainable function changes only in the directions set at initialization; internal features remain fixed. Training reduces to kernel gradient descent: $\mathcal N(0, \sigma_w^2/n_{l-1})$ 9 with $\mathcal N(0, \sigma_b^2)$ 0 the deterministic NTK (Bahri et al., 2023, Yang et al., 2020, Seleznova et al., 2022).
Mean-Field and Maximal-Update (Feature Learning): For "mean-field" or "maximal-update" ( $\mathcal N(0, \sigma_b^2)$ 1P) parameterizations, features evolve nontrivially. In the two-layer case, the empirical neuron distribution obeys a Wasserstein gradient flow PDE (Luo et al., 2020, Yang et al., 2020). For deeper nets, maximal feature learning at infinite width requires specific scaling of per-layer initialization and learning rate (Yang et al., 2020, Hajjar et al., 2021).

A sharp dichotomy theorem holds: For multi-layer MLPs, either the limit reduces entirely to a kernel regime, or it admits $\mathcal N(0, \sigma_b^2)$ 2 feature evolution, depending on the scaling exponents (Yang et al., 2020). No intermediate is possible at infinite width in standard architectures.

4. Extensions: Non-Gaussianities, Deep/Proportional Limits, Structured Models

Beyond Gaussianity: For models built with stable laws (heavy-tailed initializations), the infinite-width limit may yield stable processes with layerwise-dependent scaling, not GPs (Bordino et al., 2023).
Infinite-depth with Infinite-width: For architectures where both width $\mathcal N(0, \sigma_b^2)$ 3 and depth $\mathcal N(0, \sigma_b^2)$ 4 jointly diverge at constant ratio, the output law is not generically Gaussian, exhibiting either log-Gaussian (for ReLU ResNets) (Li et al., 2021) or mixtures of Gaussians (for linear nets) (Bassetti et al., 2024).
Attention and Transformers: In the infinite-width, fixed-heads limit, the law of multi-head attention layers is hierarchical non-Gaussian—a Gaussian mixture conditioned on random similarity scores—implying nontrivial heavy-tailed feature statistics that differ fundamentally from the MLP case (Sakai et al., 1 Jun 2025).
Sparsity and Structure: Pruning masks in sparse MLPs of diverging width can converge to a limiting graphon, dictating a "Graphon NTK" operator whose spectrum governs trainability and convergence rates in the limit (Pham et al., 20 Oct 2025).
Graph Neural Networks: The infinite-width GCN limit yields a fixed NNGP kernel for node/graph classification. Introducing a regularization "knob" enables interpolation between kernel (fixed representation) and rich feature-learning on heterophilous graphs (Anson et al., 2024).

5. Universality and Quantitative Convergence Rates

The convergence to Gaussian process behavior and/or kernel dynamics in the infinite-width limit is universal under mild moment and regularity assumptions on weight distributions and nonlinearities (Giovagnini et al., 4 May 2026, Hanin, 2021). Recent advances provide explicit non-asymptotic bounds: $\mathcal N(0, \sigma_b^2)$ 5 for networks of depth $\mathcal N(0, \sigma_b^2)$ 6, and similar rates hold in total variation and Wasserstein- $\mathcal N(0, \sigma_b^2)$ 7 (Giovagnini et al., 4 May 2026, Hanin, 2021). For Bayesian neural networks with hierarchical priors, the infinite-width posterior converges to a Student-t process, with convergence rate $\mathcal N(0, \sigma_b^2)$ 8 (Caporali et al., 6 Feb 2025, Pacelli et al., 2022).

These results establish the applicability of infinite-width theory to realistic, moderate-width regimes, with explicit error quantification.

6. Implications for Architecture Design, Training, and Generalization

Key implications established by these results include:

Architecture and Initialization: To access kernel (NTK/GP) behavior, initialize in the "ordered" (non-chaotic) phase and choose depth-to-width ratios $\mathcal N(0, \sigma_b^2)$ 9 (Seleznova et al., 2022). For feature learning at infinite width, maximal-update (μP) or mean-field scaling must be enforced (Yang et al., 2020, Hajjar et al., 2021, Chizat et al., 2022).
Generalization and Implicit Bias: The phase diagram governs the transition between regimes with explicit kernel bias (lazy/linear), nonlinear mean-field regularization, or strong feature condensation (Luo et al., 2020).
Application to Sparsity and Ensembles: Spectral analysis of limiting NTKs, whether structured (graphon, ensemble) or not, predicts convergence speeds, trainability, and generalization in large sparsified networks (Pham et al., 20 Oct 2025, Velikanov et al., 2022).
Beyond Kernels: In regimes where kernel approximation fails (deep, proportional depth, strong feature coupling), the infinite-width limit reveals nontrivial interactions, non-Gaussian statistics, and label-dependent posterior covariances not seen in kernel approximations (Bassetti et al., 2024, Sakai et al., 1 Jun 2025).

7. Limitations, Open Directions, and Practical Guidance

Breakdown Regimes: Infinite-width theory becomes inaccurate at extreme depths relative to width, as in recurrent architectures where finite-width corrections accumulate on a $\begin{align*} K^{(0)}(x, x') &= \sigma_b^2 + \sigma_w^2 x \cdot x' / n_0 \ K^{(l)}(x, x') &= \sigma_b^2 + \sigma_w^2 \mathbb{E}_{(u, v)\sim N(0, \Sigma^{(l-1)})}\bigl[ \phi(u) \phi(v) \bigr] \end{align*}$ 0 timescale (Seleznova, 6 May 2026).
Finite-Width Corrections: $\begin{align*} K^{(0)}(x, x') &= \sigma_b^2 + \sigma_w^2 x \cdot x' / n_0 \ K^{(l)}(x, x') &= \sigma_b^2 + \sigma_w^2 \mathbb{E}_{(u, v)\sim N(0, \Sigma^{(l-1)})}\bigl[ \phi(u) \phi(v) \bigr] \end{align*}$ 1 corrections can be systematically computed using perturbative or statistical mechanics approaches, and in certain cases explain observed feature learning in large but finite networks (Bahri et al., 2023, Pacelli et al., 2022).
Adaptive Optimizers: Classical mean-field theory may fail for deep nets under vanilla GD; adaptive optimizers (e.g., RMSProp) restore nontrivial MF limits (Golikov, 2020).
Non-Gaussianity and Heavy Tails: Stable initialization, deep joint limits, attention models, and hierarchically structured outputs require limit theories beyond classical Gaussian/NTK/GP paradigms (Bordino et al., 2023, Sakai et al., 1 Jun 2025, Li et al., 2021).
Design Recommendations: The correct scaling for initialization and learning rates must be selected according to the desired regime (kernel vs. feature-learning), with explicit procedures now identified for both (Yang et al., 2020, Hajjar et al., 2021).

In sum, the infinite-width limit framework has unified much of neural network theory, revealing dichotomies between lazy and feature-learning regimes, mapping phase diagrams in scaling space, and delivering universal quantitative control under general conditions. Rapidly expanding analytical techniques, including tensor program formalisms and statistical mechanics, now treat structured models, non-Gaussianity, and joint depth-width scaling, providing precise guidance for both theoretical development and large-scale architecture/training design (Luo et al., 2020, Bahri et al., 2023, Yang et al., 2020, Sakai et al., 1 Jun 2025).