Mean-Field Limits of Neural Networks

Updated 4 July 2026

Mean-field limits of neural networks are a continuum description that replaces discrete neurons with a probability law over parameters as width becomes infinite.
They employ techniques like PDEs, integro-differential equations, and McKean–Vlasov formulations to capture the evolution of network dynamics.
This framework provides insights into optimization, finite-width corrections, and global convergence, deepening our understanding of deep learning behavior.

Mean-field limits of neural networks are asymptotic descriptions in which a network with width tending to infinity, together with a compatible scaling of optimization, is replaced by a deterministic evolution of a probability law over parameters, neurons, functions, or paths. In this regime, empirical averages over neurons converge to continuum objects, stochastic gradient dynamics are recast as continuity equations, integro-differential systems, or McKean–Vlasov equations, and the resulting limit is generally nonlinear and feature-learning rather than a fixed-kernel linearization (Sirignano et al., 2019, Golikov, 2020, Araújo et al., 2019).

1. Scaling regimes and state variables

The basic two-layer mean-field model writes

$f_m(x;\theta_1,\dots,\theta_m)=m^{-1}\sum_{i=1}^m a_i\,\phi(w_i^T x), \qquad \theta_i=(a_i,w_i)\in\mathbb R^{1+d_0},$

with empirical measure

$\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$

The population loss is

$R(\rho)=\mathbb E_{x,y}[\ell(y,f[\rho;x])], \qquad f[\rho;x]=\int a\,\phi(w^T x)\,\rho(da,dw).$

Under mean-field scaling, initialization is $O(1)$ , learning rates are width-dependent, and parameter increments are $O(1/m)$ , so that finite parameter changes persist in the infinite-width limit (Golikov, 2020).

For deep fully connected networks, the scaling is layer dependent. In the formulation of Sirignano and Spiliopoulos, an $L$ -hidden-layer network has forward normalization by $1/N_{\ell-1}$ at each hidden layer and $1/N_L$ at the output. For the $L$ -layer case, the learning-rate scaling is

$\alpha_C=\frac{N_L}{N_1},\qquad \alpha_{W,1}=1,\qquad \alpha_{W,\ell}=\frac{N_\ell}{N_1}\quad (\ell=2,\dots,L).$

This scaling “balances contributions of each layer in the infinite-width limit”; with a constant rate, the network “freezes” (Sirignano et al., 2019).

Several mathematically distinct state descriptions have been developed for these limits.

Setting	State variable	Limit evolution
Two-layer mean-field training	$\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 0	discrete map or parameter-space PDE
Deep sequential limit	empirical layerwise parameter distributions	deterministic integro-differential system / Liouville PDE
Three-layer neuronal embedding	$\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 1 on a fixed probability space	deterministic ODE system
Deep path-space limit	law $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 2 of input-output paths	McKean–Vlasov ODE
Functional-space three-layer limit	$\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 3	kernel gradient flow with time-varying kernel

The deep sequential formulation uses empirical parameter distributions over products of layerwise parameters and output weights, seeded by an initial law $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 4 with compact support and continuous densities (Sirignano et al., 2019). The three-layer neuronal-embedding framework places finite-width networks and the infinite-width limit on a single probability space $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 5, with deterministic anchor functions $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 6, so that initialization for all widths is coupled through sampled abstract neurons (Pham et al., 2021). A different deep construction treats each input-to-output parameter path as a particle and studies the empirical distribution of such paths, yielding a path-space McKean–Vlasov limit for deep networks with fixed random features near the input and output (Araújo et al., 2019). For partially trained three-layer models with a fixed random first layer, the relevant state is a measure on a functional space, $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 7, where neurons are represented by output weights and pre-activation functions (Chen et al., 2022).

These formulations are not interchangeable. A plausible implication is that “the” mean-field limit of a neural network is not a single universal object, but a family of asymptotic descriptions indexed by architecture, training rule, and parametrization.

2. Limiting equations and derivation methods

In the two-layer setting, discrete-time gradient descent on the empirical measure is expressed by a deterministic Markov operator: $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 8 where $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 9 pushes each atom by the parameter update $R(\rho)=\mathbb E_{x,y}[\ell(y,f[\rho;x])], \qquad f[\rho;x]=\int a\,\phi(w^T x)\,\rho(da,dw).$ 0. The corresponding continuous-time limit, obtained by rescaling time by $R(\rho)=\mathbb E_{x,y}[\ell(y,f[\rho;x])], \qquad f[\rho;x]=\int a\,\phi(w^T x)\,\rho(da,dw).$ 1, is the parameter-space PDE

$R(\rho)=\mathbb E_{x,y}[\ell(y,f[\rho;x])], \qquad f[\rho;x]=\int a\,\phi(w^T x)\,\rho(da,dw).$ 2

The induced network evolution is likewise closed at the level of the output function $R(\rho)=\mathbb E_{x,y}[\ell(y,f[\rho;x])], \qquad f[\rho;x]=\int a\,\phi(w^T x)\,\rho(da,dw).$ 3 through a kernel built from $R(\rho)=\mathbb E_{x,y}[\ell(y,f[\rho;x])], \qquad f[\rho;x]=\int a\,\phi(w^T x)\,\rho(da,dw).$ 4 and the current measure $R(\rho)=\mathbb E_{x,y}[\ell(y,f[\rho;x])], \qquad f[\rho;x]=\int a\,\phi(w^T x)\,\rho(da,dw).$ 5 (Golikov, 2020).

For deep networks, Sirignano and Spiliopoulos obtain a limit output

$R(\rho)=\mathbb E_{x,y}[\ell(y,f[\rho;x])], \qquad f[\rho;x]=\int a\,\phi(w^T x)\,\rho(da,dw).$ 6

where the parameter trajectories satisfy coupled deterministic ODEs indexed by initialization. In the two-hidden-layer case these equations govern $R(\rho)=\mathbb E_{x,y}[\ell(y,f[\rho;x])], \qquad f[\rho;x]=\int a\,\phi(w^T x)\,\rho(da,dw).$ 7, $R(\rho)=\mathbb E_{x,y}[\ell(y,f[\rho;x])], \qquad f[\rho;x]=\int a\,\phi(w^T x)\,\rho(da,dw).$ 8, and $R(\rho)=\mathbb E_{x,y}[\ell(y,f[\rho;x])], \qquad f[\rho;x]=\int a\,\phi(w^T x)\,\rho(da,dw).$ 9, while in measure form each layer obeys a continuity equation

$O(1)$ 0

The velocity fields are explicit nonlocal functionals of $O(1)$ 1, $O(1)$ 2, and the current network output (Sirignano et al., 2019).

The deep path-space theory replaces neuronwise empirical measures by a law $O(1)$ 3 over parameter paths. A typical particle satisfies a self-consistent ODE

$O(1)$ 4

with drift obtained by replacing finite-width backpropagation averages by integrals under $O(1)$ 5. This yields existence, uniqueness, and propagation of chaos for the simultaneous width- $O(1)$ 6 limit of all hidden layers (Araújo et al., 2019).

The proofs of these limits rely on different technical packages. The sequential deep limit uses tightness in Skorokhod space, uniform moment bounds via Gronwall, martingale-vanishing arguments, identification by test-function calculus, and uniqueness via coupling or fixed-point arguments (Sirignano et al., 2019). The discrete-time two-layer theory proves convergence by showing that the operator $O(1)$ 7 is Lipschitz-continuous in $O(1)$ 8 and then inducting on the iteration index (Golikov, 2020). The deep McKean–Vlasov path-space theory constructs a closed subset of measures on path space on which the McKean map is well defined and eventually contracting under Wasserstein distance (Araújo et al., 2019). The neuronal-embedding framework avoids measure-over-measure closure issues by coupling all widths to abstract neurons on a fixed probability space (Pham et al., 2021).

A recurrent point of methodology is that mean-field derivations are not purely formal law-of-large-numbers arguments. In deep models, the main obstruction is not only width but also the nested dependence created by backpropagation through multiple layers.

3. Optimization, expressivity, and global convergence

Under suitable assumptions, the limiting dynamics can inherit strong optimization properties. In the deep analysis of Sirignano and Spiliopoulos, assuming $O(1)$ 9 and, for global convergence, that $O(1/m)$ 0 is bounded, non-constant, monotone, and hence discriminatory, together with full support assumptions on the data measure and the initialization, the limit system is a gradient flow in probability space with Lyapunov function

$O(1/m)$ 1

They prove $O(1/m)$ 2 and that any stationary measure yields zero loss; any limit point $O(1/m)$ 3 as $O(1/m)$ 4 satisfies $O(1/m)$ 5 for all $O(1/m)$ 6 (Sirignano et al., 2019).

For unregularized three-layer networks, Nguyen and Pham prove a global convergence theorem in the mean-field regime under bounded-Lipschitz assumptions on $O(1/m)$ 7 and $O(1/m)$ 8, full support of the first-layer initialization on $O(1/m)$ 9, convergence of the mean-field parameters $L$ 0, and density of $L$ 1 in $L$ 2. If $L$ 3 is convex, then

$L$ 4

More generally, when $L$ 5 and $L$ 6 is a deterministic function of $L$ 7, then $L$ 8. A central ingredient is a universal-approximation property at any finite training time, obtained through an algebraic topology argument showing that the first-layer support remains all of $L$ 9 (Pham et al., 2021).

The functional-space mean-field theory of partially trained three-layer networks yields a different optimization picture. There the limit output

$1/N_{\ell-1}$ 0

obeys a kernel gradient flow with a time-varying kernel

$1/N_{\ell-1}$ 1

Because $1/N_{\ell-1}$ 2 is symmetric and, under mild assumptions, remains strictly positive-definite for all $1/N_{\ell-1}$ 3, the empirical $1/N_{\ell-1}$ 4-loss satisfies a linear-rate decay estimate of the form

$1/N_{\ell-1}$ 5

whenever $1/N_{\ell-1}$ 6 uniformly in time (Chen et al., 2022).

These results clarify a common misconception: nonconvexity of the finite-width parameterization does not preclude strong convergence statements for the mean-field limit. The available theorems are conditional on activation regularity, support conditions, convergence modes, or positivity of time-varying kernels, but they establish that multilayer mean-field dynamics can be globally optimizing rather than merely descriptive.

4. Fluctuations, finite-width corrections, and trajectorial stability

The law-of-large-numbers limit is only the first term in the large-width expansion. For a single hidden layer, the centered fluctuation

$1/N_{\ell-1}$ 7

converges in a dual Sobolev space $1/N_{\ell-1}$ 8 to a Gaussian process solving a linear stochastic partial differential equation

$1/N_{\ell-1}$ 9

or equivalently

$1/N_L$ 0

The proof uses weak convergence, relative compactness, martingale problems, and uniqueness in a suitable Sobolev space (Sirignano et al., 2018).

For multilayer networks, Nguyen and Pham derive a second-order mean-field limit through the neuronal-embedding framework. The rescaled parameter deviations

$1/N_L$ 1

are approximated in $1/N_L$ 2 by a linear ODE system driven by a Gaussian sampling fluctuation process. They further obtain a central-limit theorem for the output fluctuation $1/N_L$ 3, with convergence in finite-dimensional moments at rate $1/N_L$ 4. Under additional assumptions, the width-scaled asymptotic variance

$1/N_L$ 5

is non-increasing and tends to zero as $1/N_L$ 6, so gradient descent in the mean-field regime progressively biases training toward solutions with “minimal fluctuation” in the learned output function (Pham et al., 2021).

Mean-field theory is also used as a quantitative approximation to finite-width training trajectories. Sirignano and Spiliopoulos state that finite networks, “even moderately wide,” follow the mean-field curves closely and refer to a CIFAR10 example (Sirignano et al., 2019). Nguyen’s nonrigorous multilayer formalism reports that as the uniform width $1/N_L$ 7 grows, the full training-loss and test-loss curves “lock in” onto a limiting trajectory; for depths $1/N_L$ 8, curves for $1/N_L$ 9 “essentially coincide” (Nguyen, 2019). In the discrete-time two-layer theory, the mean-field limit is shown to approximate finite-width networks better than the NTK limit when learning rates are not very small, precisely because it retains the $L$ 0 interaction term associated with feature learning (Golikov, 2020).

This finite-width perspective changes the role of the mean-field limit. It is not only an asymptotic simplification; it is also a controlled surrogate for wide-but-finite dynamics, with explicit next-order corrections in shallow networks and second-order fluctuation systems in multilayer settings.

5. Relation to NTK, $L$ 1P, and control-theoretic extensions

Mean-field and NTK limits arise from different scalings. Under NTK scaling,

$L$ 2

with $L$ 3 learning rate and $L$ 4 parameter movement, the network admits a first-order Taylor linearization and the kernel converges to a fixed $L$ 5. By contrast, the mean-field scaling preserves parameter movement of order $L$ 6 over $L$ 7 iterations and retains the nonlinear $L$ 8 term. The discrete-time comparison shows that if $L$ 9, then $\alpha_C=\frac{N_L}{N_1},\qquad \alpha_{W,1}=1,\qquad \alpha_{W,\ell}=\frac{N_\ell}{N_1}\quad (\ell=2,\dots,L).$ 0 matters at finite width and the mean-field limit tracks finite-width networks better; if $\alpha_C=\frac{N_L}{N_1},\qquad \alpha_{W,1}=1,\qquad \alpha_{W,\ell}=\frac{N_\ell}{N_1}\quad (\ell=2,\dots,L).$ 1, $\alpha_C=\frac{N_L}{N_1},\qquad \alpha_{W,1}=1,\qquad \alpha_{W,\ell}=\frac{N_\ell}{N_1}\quad (\ell=2,\dots,L).$ 2 becomes negligible and the NTK approximation suffices (Golikov, 2020).

The same framework also identifies an “intermediate” lazy limit that is neither NTK nor mean-field and shows a depth-dependent optimizer effect: for $\alpha_C=\frac{N_L}{N_1},\qquad \alpha_{W,1}=1,\qquad \alpha_{W,\ell}=\frac{N_\ell}{N_1}\quad (\ell=2,\dots,L).$ 3, plain gradient descent with MF-type hyper-scaling has no non-trivial discrete-time mean-field limit, while RMSProp does. Under RMSProp, normalization removes the layer-wise width dependence, all layers move $\alpha_C=\frac{N_L}{N_1},\qquad \alpha_{W,1}=1,\qquad \alpha_{W,\ell}=\frac{N_\ell}{N_1}\quad (\ell=2,\dots,L).$ 4, and one recovers a non-trivial discrete-time mean-field limit for any depth (Golikov, 2020). This is a precise statement about discrete-time scaling rather than a universal impossibility result for deep mean-field analysis.

The $\alpha_C=\frac{N_L}{N_1},\qquad \alpha_{W,1}=1,\qquad \alpha_{W,\ell}=\frac{N_\ell}{N_1}\quad (\ell=2,\dots,L).$ 5P literature extends the scope of mean-field analysis beyond classical $\alpha_C=\frac{N_L}{N_1},\qquad \alpha_{W,1}=1,\qquad \alpha_{W,\ell}=\frac{N_\ell}{N_1}\quad (\ell=2,\dots,L).$ 6 parametrizations. For noisy gradient descent with entropic regularization in wide two-layer networks under $\alpha_C=\frac{N_L}{N_1},\qquad \alpha_{W,1}=1,\qquad \alpha_{W,\ell}=\frac{N_\ell}{N_1}\quad (\ell=2,\dots,L).$ 7P, Acciaio, Heiss, Pammer, and Yan formulate the dynamics as a Fokker–Planck PDE

$\alpha_C=\frac{N_L}{N_1},\qquad \alpha_{W,1}=1,\qquad \alpha_{W,\ell}=\frac{N_\ell}{N_1}\quad (\ell=2,\dots,L).$ 8

prove global existence and uniqueness in a maximal weighted-moment class $\alpha_C=\frac{N_L}{N_1},\qquad \alpha_{W,1}=1,\qquad \alpha_{W,\ell}=\frac{N_\ell}{N_1}\quad (\ell=2,\dots,L).$ 9, obtain a uniform-in-time squared-Wasserstein propagation rate $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 00, characterize identifiability modulo finite-rank realization symmetry, and derive a sparse-dictionary decomposition of the long-time limit under a Barron–Hermite target condition (Xodarev, 23 May 2026).

A complementary control-theoretic extension interprets the mean-field gradient flow of a two-layer network as a McKean–Vlasov stochastic-control problem. The measure-valued continuity equation

$\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 01

is linked to a Hamilton–Jacobi–Bellman equation on $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 02 and a Dynamic Programming Principle. This yields a Finsler-type metric $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 03 on probability measures and the variational characterization

$\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 04

for early stopping, up to a controllable error. The long-time limit selects, among global minimizers, one of minimal $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 05 (Acciaio et al., 21 Mar 2026).

Mean-field analysis has also been adapted to particle-based optimizers other than gradient descent. For two-layer networks trained by consensus-based optimization, the network-width limit lifts parameters to $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 06, the particle-ensemble limit lifts the ensemble to $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 07, and the resulting dynamics become a gradient flow on the Wasserstein-over-Wasserstein space. In this setting the population variance contracts as

$\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 08

under the stated barycenter and absolute-continuity assumptions (Deyn et al., 26 Nov 2025).

6. Broader meanings in neural-network research

The phrase “mean-field limit of neural networks” is not confined to supervised training of artificial feedforward models. In statistical mechanics and theoretical neuroscience, it refers to several neighboring but non-identical asymptotic programs.

For Hopfield-like and related rate networks, mean-field limits describe thermodynamic or stochastic population equations. One recent thermodynamic result proves that a measure-concentration assumption on order parameters suffices for existence of the asymptotic free energy of the Hopfield model and recovers the replica-symmetric free-energy formula through a decomposition into hard and soft spin-glass free energies (Agliari et al., 2024). A separate universality result establishes the mean-field equations for large networks of Hopfield-like neurons on $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 09 without assuming i.i.d. zero-mean Gaussian synaptic weights; the limit is stochastic and characterized by a mean function $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 10 and a correlation function $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 11, with effective noise given through a Volterra equation (Faugeras et al., 2024). For correlated Gaussian synaptic weights, Faugeras, Maclaurin, and Tanré prove an annealed large deviation principle and identify a unique Gaussian, non-Markovian minimizer, leading to an infinite countable family of linear non-Markovian SDEs in the limit (1901.10248).

For spiking or integrate-and-fire networks, mean-field limits are often PDE limits for empirical measures rather than parameter distributions. In dense stochastic integrate-and-fire networks with arbitrary synaptic weights satisfying $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 12 scaling, the empirical measure converges, up to subsequence, to a spatially extended PDE indexed by a graphon variable $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 13: $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 14 with $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 15 determined by a graphon kernel $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 16 (Jabin et al., 2024). For sparse integrate-and-fire networks, a tree-indexed extension of the BBGKY hierarchy yields convergence of one-particle observables to a non-exchangeable Vlasov equation under generalized mean-field scaling and non-vanishing diffusion (Jabin et al., 2023). Replica-mean-field theory for intensity-based spiking networks takes a different route: instead of letting interactions vanish as $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 17, it considers infinitely many interacting replicas of a fixed finite network and derives stationary ODEs and self-consistency equations via the Poisson Hypothesis, preserving finite-size effects such as saturation and sparse-induced metastability (Baccelli et al., 2019).

Even in rate-based noisy networks, the term can mean a macroscopic reduction of neuronal activity rather than a parameter-space transport equation. For a random network of noisy rate neurons on an Erdős–Rényi graph, a second-order stochastic mean-field model for the mean rate $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 18 and variance $\rho_k^m=\frac1m\sum_{i=1}^m \delta_{\theta_i^{(k)}}.$ 19 distinguishes the effects of external and internal noise: in the thermodynamic limit, external noise reshapes the deterministic mean-field vector field, while internal noise affects only the variance equation (Klinshov et al., 2015).

These adjacent literatures show that “mean-field” is a family resemblance term. In machine learning, it usually denotes infinite-width training dynamics in parameter or function space; in statistical mechanics and neuroscience, it often denotes thermodynamic limits of interacting neuronal states, empirical membrane-potential laws, or free-energy formulas. The shared structure is passage from many-body randomness to a deterministic or self-consistent continuum description, but the state variables, limiting equations, and proof techniques differ substantially.