Mean-Field Dynamics of Transformers

Updated 3 December 2025

Mean-field dynamics of transformers is a framework that models token interactions as a system of particles evolving on the unit sphere via non-linear PDEs.
It simplifies the self-attention mechanism using an interacting-particle formulation and gradient flows to reveal synchronization, clustering, and metastable states.
This approach provides practical insights into how normalization, kernel choices, and hyperparameters drive phase transitions and influence representation collapse in deep architectures.

Mean-field dynamics of transformers refers to the mathematical framework that interprets the evolution of token representations in (deep) transformer architectures as a system of interacting particles, and analyzes their behavior at the large-token (or infinite-width) limit, often via non-linear partial differential equations on the sphere. This approach connects the discrete self-attention mechanism with continuum, measure-valued flows—revealing phenomena such as synchronization (clustering), metastability, explicit contraction rates, normalization-induced phase transitions, and the precise mechanisms that drive either representation collapse or persistent multi-modal structure as depth increases (Rigollet, 1 Dec 2025).

1. Interacting-Particle Formulation and Mean-Field Limit

Transformer self-attention, after layer normalization, can be reduced to a time evolution of $n$ token embeddings $x_i(t) \in S^{d-1} \subset \mathbb{R}^d$ (unit sphere). The scaled dot-product self-attention is interpreted as a particle system with pairwise (possibly non-symmetric) interaction via the kernel $K(x,y) = e^{\beta \langle x, y \rangle}$ , parameterized by $\beta$ (inverse temperature):

$(A_\beta)_{ij} = \frac{\exp(\beta \langle x_i, x_j \rangle)}{\sum_{k=1}^n \exp(\beta \langle x_i, x_k \rangle)}$

$x_i \mapsto \sum_{j=1}^n (A_\beta)_{ij} x_j$

Transition to continuous depth (ODE limit) yields the self-attention flow:

$\dot x_i(t) = P_{x_i(t)}\left(\frac{1}{Z_i(t)} \sum_{j=1}^n e^{\beta \langle x_i(t), x_j(t) \rangle} x_j(t)\right), \qquad Z_i(t) = \sum_{k=1}^n e^{\beta \langle x_i(t), x_k(t) \rangle}$

where $P_x(y) = y - \langle x, y \rangle x$ projects onto the sphere's tangent space. An unnormalized variant drops the softmax denominator:

$\dot x_i(t) = \frac{1}{n} \sum_{j=1}^n e^{\beta \langle x_i(t), x_j(t) \rangle} x_j(t)$

In the limit $n \to \infty$ , the empirical measure $\mu^n_t = \frac{1}{n} \sum_{i=1}^n \delta_{x_i(t)}$ converges (propagation of chaos) to a deterministic $\mu_t$ solving a non-linear continuity (McKean–Vlasov) equation on the sphere:

$\partial_t \mu_t + \nabla \cdot (\mu_t v[\mu_t]) = 0, \qquad v[\mu_t](x) = \int_{S^{d-1}} e^{\beta\langle x, y \rangle} y \, d\mu_t(y)$

(Rigollet, 1 Dec 2025, Geshkovski et al., 2023)

This structure is common to diverse mean-field analyses of self-attention flows, including variants with general value matrices, kernel functions, and normalization (Castin et al., 30 Jan 2025, Burger et al., 6 Jan 2025, Chen et al., 20 Apr 2025).

2. Gradient-Flow Structure and Energy Landscape

For standard (unnormalized) self-attention, the mean-field PDE is a Wasserstein-2 ( $W_2$ ) gradient flow of the interaction energy functional:

$\mathcal{E}_\beta(\mu) = \frac{1}{2} \iint_{S^{d-1} \times S^{d-1}} e^{\beta \langle x, y \rangle} d\mu(x) d\mu(y)$

$\partial_t \mu_t = \nabla_{W_2} \mathcal{E}_\beta(\mu_t)$

(Rigollet, 1 Dec 2025, Burger et al., 6 Jan 2025)

In the presence of RMS or layer normalization, the dynamics are further restricted to the sphere, and the gradient flow takes place within the corresponding metric structure (Burger et al., 6 Jan 2025). The interaction energy $\mathcal{E}_\beta$ is typically non-convex for generic $Q$ , $K$ , and $V$ ; its minima and saddles correspond to stationary states of the token distribution—e.g., uniform spread or clustered/multimodal distributions (Burger et al., 6 Jan 2025, Castin et al., 30 Jan 2025).

The choice of score matrix $D = QK^T$ (possibly symmetric) governs the shape of the energy landscape: isotropic $D$ favors uniform distributions, while $D$ with a dominant negative eigenvalue favors clustering at corresponding eigenvectors (Burger et al., 6 Jan 2025).

3. Clustering, Synchronization, and Metastability

A key result is almost-sure global synchronization for dimension $d \geq 3$ : almost every initial configuration is asymptotically attracted to a single synchronized (clustered) state ( $\|x_i(t) - x_j(t)\| \to 0$ for all $i,j$ ) (Rigollet, 1 Dec 2025, Chen et al., 20 Apr 2025, Geshkovski et al., 2023). The analysis leverages an analytic-energy–Łojasiewicz argument for convergence to critical points and a linear stability analysis showing all non-clustered configurations are unstable saddles.

For large $\beta$ , the mean-field energy admits a family of “ $k$ -cluster” saddle states, which can trap the dynamics for exponentially long metastable periods. The typical dynamic is multistage (Bruno et al., 30 Oct 2024, Bruno et al., 29 Sep 2025):

Initial compressive (alignment) phase: Fast contraction onto a low-dimensional subspace.
Meta-stable, multi-cluster phase: Tokens organize into $k$ well-separated clusters; each subcluster collapses rapidly, followed by a slow drift along a metastable manifold (parametrized by Gegenbauer modes in $d$ dimensions).
Final collapse: Clusters merge sequentially in abrupt saddle-to-saddle transitions, until all tokens collapse to a single cluster.

Explicit ODEs describe the rate of inner product contraction in symmetric ("equiangular") settings:

$\dot{\rho} = \frac{2}{n} e^{\beta \rho} (1 - \rho) ((n-1)\rho + 1)$

with $1 - \rho(t) \approx \exp(-2 e^\beta t)$ as $t \to \infty$ in the unnormalized model (Rigollet, 1 Dec 2025).

4. Quantitative Rates, Phase Transitions, and Normalization Effects

Quantitative convergence rates to consensus can be established using local Polyak–Łojasiewicz (PL) inequalities (Chen et al., 20 Apr 2025). For suitably regular initial data, $W_2(\mu_t, \delta_{x_\infty})$ contracts exponentially, with explicit rate constants depending on $\beta$ and initial position (cap-supported vs. general). There are regimes (small $\beta$ or suitably regular initializations) with uniform explicit exponential rates, but for large $\beta$ and certain "spread-out" densities, synchronization can fail or slow (existence of non-synchronizing $L^2$ -densities).

Normalization schemes—such as layer normalization, RMS normalization, or scaling—affect the geometry and contraction rates (Rigollet, 1 Dec 2025, Burger et al., 6 Jan 2025). In particular, normalization introduces phase transitions in long-sequence attention: for normalization strengths or attention sharpness (large $\beta$ ) above certain thresholds, the contraction rate and metastable regime structure change, governing whether expressive multi-cluster states persist or rapidly collapse.

A phase transition is also evident in the number of clusters emerging from uniform initialization: the dominant $\ell_\beta$ mode (maximizing a function of Gegenbauer/Bessel coefficients) determines the number of clusters that nucleate and persist prior to total synchronization (Bruno et al., 30 Oct 2024):

$\ell_\beta = \arg\max_{\ell \geq 1} \{ \ell(\ell+d-2) \hat W_\ell \}$

with $W(q) = \beta^{-1} e^{\beta q}$ and, in $d=2$ , $\hat W_\ell = \beta^{-1} I_\ell(\beta)$ .

5. Multiscale Analysis and Practical Implications

In the moderate interaction regime ( $N$ tokens, $\beta = \beta(N) \to \infty$ slowly), transformer mean-field dynamics exhibit three scale-separated stages (Bruno et al., 29 Sep 2025):

Alignment phase: Transport rapidly concentrates the distribution along the top eigenspace of $VK^TQ$ .
Heat phase: Diffusive (or smoothing) dynamics on the aligned manifold, modeled as a forward/backward heat equation, can further cluster or spread tokens, depending on the sign of the induced "diffusion."
Pairing phase: Residual clusters merge one-by-one through exponentially slow pairwise interactions.

These phases are confirmed both by theoretical estimates and numerical experiments, with fine control on time scales (alignment: $O(1)$ , heat: $O(1/\beta)$ , pairing: $O(e^{c\beta})$ ).

The persistence of multi-cluster metastable states is critical to in-context learning and next-token prediction: each cluster corresponds to a set of coherent hypotheses maintained by the self-attention mechanism, with cluster weights interpreted as attention mass assigned to these hypotheses (Bruno et al., 30 Oct 2024, Rigollet, 1 Dec 2025).

Hyperparameters such as the dimension $d$ , inverse temperature $\beta$ , normalization strength, and context (sequence) length directly modulate the energy landscape, the number of meta-stable clusters, and the rate of over-smoothing (final collapse).

6. Connections, Generalizations, and Future Directions

The mean-field formalism connects transformer dynamics to synchronization models (Kuramoto, mean-shift clustering), aggregation equations, and Wasserstein gradient flows (Rigollet, 1 Dec 2025, Geshkovski et al., 2023, Burger et al., 6 Jan 2025). The theory rigorously predicts the phenomenon of representation collapse ("over-smoothing") in deep transformers and identifies design parameters (kernel spectrum, normalization, attention sharpness) that govern the phase behavior.

Extensions cover multi-head attention, generalized kernels (L2, Sinkhorn, entropic OT), and architectures with feed-forward and nonlinearity, each retaining mean-field gradient-flow or Vlasov-type PDE structure (Castin et al., 30 Jan 2025). Analysis of stationary points, energy minimizers, and spectral design identifies directions to avoid unwanted collapse (e.g., enforcing near-isotropic kernels) or to maintain expressive, multi-modal representations (Burger et al., 6 Jan 2025, Castin et al., 30 Jan 2025).

Explicit approximation of mean-field vector fields by finite transformers is achievable with provable error bounds, linking the continuum limit theory to practical finite-model design (Biswal et al., 6 Oct 2024).

Remaining open problems include uniform-in- $n$ convergence rates for particle approximations, detailed dynamical analyses of multi-head and masked attention, non-convex geometry of energy landscapes, invariant manifold structures for saddle regions, and extensions to causal/masked attention and time-varying weights.

References:

(Rigollet, 1 Dec 2025, Bruno et al., 30 Oct 2024, Chen et al., 20 Apr 2025, Geshkovski et al., 2023, Bruno et al., 29 Sep 2025, Castin et al., 30 Jan 2025, Burger et al., 6 Jan 2025, Biswal et al., 6 Oct 2024)