Papers
Topics
Authors
Recent
Search
2000 character limit reached

Infinite-Width Limit: Theory & Implications

Updated 9 May 2026
  • Infinite-Width Limit is the regime where neural network widths tend to infinity, simplifying training dynamics to Gaussian process and deterministic kernel behaviors.
  • The framework distinguishes between lazy (kernel) and feature-learning regimes using scaling phase diagrams that guide initialization and architecture design.
  • Quantitative convergence bounds and extensions to non-Gaussian, deep proportional, and structured models offer practical insights for theoretical and applied research.

The infinite-width limit refers to a mathematical regime in which the width (number of neurons per layer) of a neural network tends to infinity. In this asymptotic limit, both the initialization and training dynamics of deep networks often simplify dramatically, revealing universal behaviors such as Gaussian process (GP) priors, deterministic training kernels, and phase diagrams of scaling regimes. The infinite-width analysis provides a principled foundation for understanding neural networks' expressivity, generalization, implicit regularization, and feature learning capacities.

1. Foundations: Definition, Gaussian Process Limit, and Kernel Recursions

In fully connected feed-forward networks, sending all hidden layer widths n1,n2,...,nLn_1, n_2, ..., n_L \to \infty at fixed depth and fixed input/output dimensions, with standard weight and bias scaling (N(0,σw2/nl1)\mathcal N(0, \sigma_w^2/n_{l-1}) for weights, N(0,σb2)\mathcal N(0, \sigma_b^2) for biases), the output functions converge in law to a Gaussian process (Hanin, 2021, Bahri et al., 2023, Giovagnini et al., 4 May 2026). The layerwise covariances are defined recursively: K(0)(x,x)=σb2+σw2xx/n0 K(l)(x,x)=σb2+σw2E(u,v)N(0,Σ(l1))[ϕ(u)ϕ(v)]\begin{align*} K^{(0)}(x, x') &= \sigma_b^2 + \sigma_w^2 x \cdot x' / n_0 \ K^{(l)}(x, x') &= \sigma_b^2 + \sigma_w^2 \mathbb{E}_{(u, v)\sim N(0, \Sigma^{(l-1)})}\bigl[ \phi(u) \phi(v) \bigr] \end{align*} for l=1,,Ll = 1,\ldots,L, with Σ(l1)\Sigma^{(l-1)} built from K(l1)K^{(l-1)}.

At infinite width and for fixed LL, the joint law of network outputs on any finite input set is multivariate Gaussian. For Bayesian neural networks, infinitesimal learning rates and quadratic losses, the posterior over functions becomes that of a Gaussian process with the so-called NNGP (Neural Network GP) kernel (Pacelli et al., 2022, Bahri et al., 2023). Similarly, the neural tangent kernel (NTK) (Bahri et al., 2023) can be computed with a coupled recursion, and governs the linearized dynamics of gradient flow around initialization.

2. Dynamical Phase Structure: NTK, Mean-Field, and Condensed Regimes

The infinite-width limit supports a taxonomy of learning regimes determined by how initialization and learning rates scale with respect to width (Luo et al., 2020, Golikov, 2020, Golikov, 2020, Yang et al., 2020). For two-layer ReLU networks,

fθα(x)=1αk=1makσ(wkx)f_{\theta}^\alpha(x) = \frac{1}{\alpha} \sum_{k=1}^m a_k \sigma(w_k^\top x)

scaling the initialization and rescaling with parameters

κ=β1β2α,γ=limmlogκlogm,γ=limmlog(β1/β2)logm\kappa = \frac{\beta_1 \beta_2}{\alpha}, \quad \gamma = -\lim_{m\to\infty} \frac{\log \kappa}{\log m}, \quad \gamma' = -\lim_{m\to\infty} \frac{\log(\beta_1/\beta_2)}{\log m}

yields three regimes as N(0,σw2/nl1)\mathcal N(0, \sigma_w^2/n_{l-1})0 (Luo et al., 2020):

  • Linear regime (N(0,σw2/nl1)\mathcal N(0, \sigma_w^2/n_{l-1})1 or N(0,σw2/nl1)\mathcal N(0, \sigma_w^2/n_{l-1})2): vanishing movement of input weights (N(0,σw2/nl1)\mathcal N(0, \sigma_w^2/n_{l-1})3). Training is governed by a fixed NTK, and the network behaves as a kernel method with exponentially convergent loss.
  • Critical regime (boundary, N(0,σw2/nl1)\mathcal N(0, \sigma_w^2/n_{l-1})4 and N(0,σw2/nl1)\mathcal N(0, \sigma_w^2/n_{l-1})5): N(0,σw2/nl1)\mathcal N(0, \sigma_w^2/n_{l-1})6 changes in weights. Dynamics reduce to a nonlinear mean-field PDE for neuron distributions, with nontrivial (non-kernel) feature evolution.
  • Condensed regime (N(0,σw2/nl1)\mathcal N(0, \sigma_w^2/n_{l-1})7 and N(0,σw2/nl1)\mathcal N(0, \sigma_w^2/n_{l-1})8): diverging change in weights; neurons cluster at discrete orientations. Implies a strongly nonlinear regime with an emergent finite set of active features.

This phase diagram precisely predicts both qualitative and quantitative properties observed in synthetic and real tasks. For deeper or more general networks, a similar diagram arises from the scaling analysis of learning rates and weight variances (Golikov, 2020, Golikov, 2020, Yang et al., 2020).

3. Training Dynamics: Kernel vs. Feature Learning Limits

The NTK and mean-field (MF) limits are distinct in the infinite-width regime:

  • NTK (Kernel/Lazy Training): For standard or "NTK" parameterizations, the trainable function changes only in the directions set at initialization; internal features remain fixed. Training reduces to kernel gradient descent: N(0,σw2/nl1)\mathcal N(0, \sigma_w^2/n_{l-1})9 with N(0,σb2)\mathcal N(0, \sigma_b^2)0 the deterministic NTK (Bahri et al., 2023, Yang et al., 2020, Seleznova et al., 2022).
  • Mean-Field and Maximal-Update (Feature Learning): For "mean-field" or "maximal-update" (N(0,σb2)\mathcal N(0, \sigma_b^2)1P) parameterizations, features evolve nontrivially. In the two-layer case, the empirical neuron distribution obeys a Wasserstein gradient flow PDE (Luo et al., 2020, Yang et al., 2020). For deeper nets, maximal feature learning at infinite width requires specific scaling of per-layer initialization and learning rate (Yang et al., 2020, Hajjar et al., 2021).

A sharp dichotomy theorem holds: For multi-layer MLPs, either the limit reduces entirely to a kernel regime, or it admits N(0,σb2)\mathcal N(0, \sigma_b^2)2 feature evolution, depending on the scaling exponents (Yang et al., 2020). No intermediate is possible at infinite width in standard architectures.

4. Extensions: Non-Gaussianities, Deep/Proportional Limits, Structured Models

  • Beyond Gaussianity: For models built with stable laws (heavy-tailed initializations), the infinite-width limit may yield stable processes with layerwise-dependent scaling, not GPs (Bordino et al., 2023).
  • Infinite-depth with Infinite-width: For architectures where both width N(0,σb2)\mathcal N(0, \sigma_b^2)3 and depth N(0,σb2)\mathcal N(0, \sigma_b^2)4 jointly diverge at constant ratio, the output law is not generically Gaussian, exhibiting either log-Gaussian (for ReLU ResNets) (Li et al., 2021) or mixtures of Gaussians (for linear nets) (Bassetti et al., 2024).
  • Attention and Transformers: In the infinite-width, fixed-heads limit, the law of multi-head attention layers is hierarchical non-Gaussian—a Gaussian mixture conditioned on random similarity scores—implying nontrivial heavy-tailed feature statistics that differ fundamentally from the MLP case (Sakai et al., 1 Jun 2025).
  • Sparsity and Structure: Pruning masks in sparse MLPs of diverging width can converge to a limiting graphon, dictating a "Graphon NTK" operator whose spectrum governs trainability and convergence rates in the limit (Pham et al., 20 Oct 2025).
  • Graph Neural Networks: The infinite-width GCN limit yields a fixed NNGP kernel for node/graph classification. Introducing a regularization "knob" enables interpolation between kernel (fixed representation) and rich feature-learning on heterophilous graphs (Anson et al., 2024).

5. Universality and Quantitative Convergence Rates

The convergence to Gaussian process behavior and/or kernel dynamics in the infinite-width limit is universal under mild moment and regularity assumptions on weight distributions and nonlinearities (Giovagnini et al., 4 May 2026, Hanin, 2021). Recent advances provide explicit non-asymptotic bounds: N(0,σb2)\mathcal N(0, \sigma_b^2)5 for networks of depth N(0,σb2)\mathcal N(0, \sigma_b^2)6, and similar rates hold in total variation and Wasserstein-N(0,σb2)\mathcal N(0, \sigma_b^2)7 (Giovagnini et al., 4 May 2026, Hanin, 2021). For Bayesian neural networks with hierarchical priors, the infinite-width posterior converges to a Student-t process, with convergence rate N(0,σb2)\mathcal N(0, \sigma_b^2)8 (Caporali et al., 6 Feb 2025, Pacelli et al., 2022).

These results establish the applicability of infinite-width theory to realistic, moderate-width regimes, with explicit error quantification.

6. Implications for Architecture Design, Training, and Generalization

Key implications established by these results include:

  • Architecture and Initialization: To access kernel (NTK/GP) behavior, initialize in the "ordered" (non-chaotic) phase and choose depth-to-width ratios N(0,σb2)\mathcal N(0, \sigma_b^2)9 (Seleznova et al., 2022). For feature learning at infinite width, maximal-update (μP) or mean-field scaling must be enforced (Yang et al., 2020, Hajjar et al., 2021, Chizat et al., 2022).
  • Generalization and Implicit Bias: The phase diagram governs the transition between regimes with explicit kernel bias (lazy/linear), nonlinear mean-field regularization, or strong feature condensation (Luo et al., 2020).
  • Application to Sparsity and Ensembles: Spectral analysis of limiting NTKs, whether structured (graphon, ensemble) or not, predicts convergence speeds, trainability, and generalization in large sparsified networks (Pham et al., 20 Oct 2025, Velikanov et al., 2022).
  • Beyond Kernels: In regimes where kernel approximation fails (deep, proportional depth, strong feature coupling), the infinite-width limit reveals nontrivial interactions, non-Gaussian statistics, and label-dependent posterior covariances not seen in kernel approximations (Bassetti et al., 2024, Sakai et al., 1 Jun 2025).

7. Limitations, Open Directions, and Practical Guidance

  • Breakdown Regimes: Infinite-width theory becomes inaccurate at extreme depths relative to width, as in recurrent architectures where finite-width corrections accumulate on a K(0)(x,x)=σb2+σw2xx/n0 K(l)(x,x)=σb2+σw2E(u,v)N(0,Σ(l1))[ϕ(u)ϕ(v)]\begin{align*} K^{(0)}(x, x') &= \sigma_b^2 + \sigma_w^2 x \cdot x' / n_0 \ K^{(l)}(x, x') &= \sigma_b^2 + \sigma_w^2 \mathbb{E}_{(u, v)\sim N(0, \Sigma^{(l-1)})}\bigl[ \phi(u) \phi(v) \bigr] \end{align*}0 timescale (Seleznova, 6 May 2026).
  • Finite-Width Corrections: K(0)(x,x)=σb2+σw2xx/n0 K(l)(x,x)=σb2+σw2E(u,v)N(0,Σ(l1))[ϕ(u)ϕ(v)]\begin{align*} K^{(0)}(x, x') &= \sigma_b^2 + \sigma_w^2 x \cdot x' / n_0 \ K^{(l)}(x, x') &= \sigma_b^2 + \sigma_w^2 \mathbb{E}_{(u, v)\sim N(0, \Sigma^{(l-1)})}\bigl[ \phi(u) \phi(v) \bigr] \end{align*}1 corrections can be systematically computed using perturbative or statistical mechanics approaches, and in certain cases explain observed feature learning in large but finite networks (Bahri et al., 2023, Pacelli et al., 2022).
  • Adaptive Optimizers: Classical mean-field theory may fail for deep nets under vanilla GD; adaptive optimizers (e.g., RMSProp) restore nontrivial MF limits (Golikov, 2020).
  • Non-Gaussianity and Heavy Tails: Stable initialization, deep joint limits, attention models, and hierarchically structured outputs require limit theories beyond classical Gaussian/NTK/GP paradigms (Bordino et al., 2023, Sakai et al., 1 Jun 2025, Li et al., 2021).
  • Design Recommendations: The correct scaling for initialization and learning rates must be selected according to the desired regime (kernel vs. feature-learning), with explicit procedures now identified for both (Yang et al., 2020, Hajjar et al., 2021).

In sum, the infinite-width limit framework has unified much of neural network theory, revealing dichotomies between lazy and feature-learning regimes, mapping phase diagrams in scaling space, and delivering universal quantitative control under general conditions. Rapidly expanding analytical techniques, including tensor program formalisms and statistical mechanics, now treat structured models, non-Gaussianity, and joint depth-width scaling, providing precise guidance for both theoretical development and large-scale architecture/training design (Luo et al., 2020, Bahri et al., 2023, Yang et al., 2020, Sakai et al., 1 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Infinite-Width Limit.