Training Infinitely Deep and Wide Transformers

Published 17 May 2026 in math.OC, cs.AI, cs.LG, and stat.ML | (2605.17660v1)

Abstract: Transformers have become the dominant architecture in modern machine learning, yet the theoretical understanding of their training dynamics remains limited. This paper develops a rigorous mathematical framework for analyzing gradient-based training of transformers in the mean-field regime, where both the depth (number of layers) and width (number of attention heads) tend to infinity. While ResNet training can be understood as controlling a neural ODE, transformer training corresponds to controlling a neural PDE, due to the coupling of multiple token distributions through the attention mechanism. Our mean-field model features two types of measure representations: token distributions evolving through layers and attention parameters at each layer. We establish well-posedness of the forward pass through infinitely deep transformers, characterizing token evolution via flow maps that satisfy ODEs in function spaces. Using adjoint sensitivity analysis, we derive an explicit formula for the conditional Wasserstein gradient of the training risk, involving adjoint variables governed by backward ODEs. We prove the existence and uniqueness of gradient flow curves in the conditional Wasserstein metric space, establishing a rigorous foundation for gradient-based transformer training. A key technical contribution is providing necessary and sufficient conditions for injectivity of the Neural Tangent Kernel (NTK) for attention mechanisms: we show that NTK injectivity is equivalent to linear independence of log-sum-exp functions modulo affine functions, a condition satisfied by diverse token distributions, including discrete distributions, uniform distributions, and Gaussian mixtures. Under this NTK injectivity assumption, we prove that gradient flow converges to global minima when the initial loss is sufficiently small, eliminating spurious local minima from the optimization landscape.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents a novel theoretical framework that extends mean-field training analysis to infinitely deep and wide transformers by formulating token evolution as a neural PDE.
It rigorously establishes the existence, uniqueness, and well-posedness of gradient flows using a conditional optimal transport metric with Wasserstein geometry.
Under a positive NTK condition and a local PL inequality, the analysis guarantees global convergence by excluding spurious local minima when initialization is near optimal.

Training Infinitely Deep and Wide Transformers: A Technical Perspective

Introduction and Motivation

The transformer architecture has become central in deep learning across modalities, yet a rigorous theoretical understanding of its training dynamics—especially in regimes featuring extreme depth (number of layers) and width (number of attention heads)—has remained elusive. This paper, "Training Infinitely Deep and Wide Transformers" (2605.17660), establishes a rigorous mathematical framework for analyzing transformer training when both the number of layers and the number of attention heads tend to infinity. The authors extend mean-field training theory, previously developed for ResNets, to transformer architectures, revealing fundamental differences between neural ODE regimes (for ResNets) and neural PDE formulations (for transformers) stemming from the non-locality of attention mechanisms.

Mean-Field Regimes and the Neural PDE Formulation

Infinite Width and Depth Limits

A critical distinction highlighted is the joint mean-field regime, where both width and depth diverge. The mean-field representation of the transformer involves the parameterization of each attention layer via probability measures over the space of attention parameters (i.e., queries, keys, values), and simultaneously treats token evolution through the network as measures over the token space. Unlike the independent evolution in ResNets (modeled as neural ODEs), the transformer's forward pass induces a neural PDE, coupling multiple distributions through the softmax attention operator. This requires handling push-forwards of distributions under flow maps in infinite-dimensional function spaces.

Conditional Optimal Transport and Wasserstein Geometry

The training dynamics in the mean-field regime are understood as a gradient flow on the metric space of parameter distributions, endowed with a conditional optimal transport (COT) metric—a layerwise Wasserstein-2 distance. This geometry naturally encodes the preservation of layerwise marginals and enables a rigorous variational analysis of training.

Mathematical Results

Well-Posedness of the Forward Process

The authors establish existence and uniqueness of solutions for the forward process (PDE) describing token evolution in infinitely deep transformers. They prove that, for well-behaved parameter distributions, the flow map advecting tokens through the network can be constructed as the solution to a Banach-space-valued ODE. This generalizes neural ODE results for ResNets to the setting where the evolution is genuinely non-linear and non-local due to attention.

Gradient Flows and Adjoint Sensitivity

The paper employs adjoint sensitivity analysis to explicitly characterize the conditional Wasserstein gradient of the training risk in the mean-field space. They derive backward ODEs for measure-valued adjoint variables (interpretable as the infinite-depth limit of gradients computed via backpropagation), which enter the explicit formula for the risk gradient.

Existence, Uniqueness, and Characterization of Gradient Flow

The authors prove existence and uniqueness of metric gradient flows for transformer training in the COT metric space. They show an equivalence between solutions of a continuity equation for parameter distributions and curves of maximal slope in the sense of the metric space gradient flow theory. The derived evolution dissipation inequality confirms that the dynamics strictly decrease the risk, and the gradient flow is well-posed.

Optimization Landscape and Convergence Analysis

Polyak-Łojasiewicz (PL) Inequality via Neural Tangent Kernel Analysis

A central technical contribution is the link between the positivity/conditioning of the transformer’s Neural Tangent Kernel (NTK) for attention and the risk landscape. The authors show that, under a strict positive definiteness condition for the NTK (corresponding to the linear independence of log-sum-exp, or cumulant generating, functions modulo affine functions), the risk satisfies a local Polyak-Łojasiewicz inequality. This immediately implies local linear convergence of gradient flow to a global minimizer when the initial loss is sufficiently small and when the NTK remains well-conditioned.

Bold Claim: Under the positive NTK assumption, the optimization landscape contains no spurious local minima—every local minimum is global, provided initialization is sufficiently close to optimal.

Necessary and Sufficient Conditions for NTK Injectivity

The injectivity of the NTK is characterized via the theory of log-sum-exp function independence. The authors show that NTK injectivity holds generically for broad classes of token distributions, including:

Discrete measures with distinct supports (almost surely satisfied, as proved by a measure-theoretic argument for i.i.d. samples from absolutely continuous measures).
Uniform distributions on cubes with distinct side lengths.
Symmetric multivariate Laplace distributions or two-component Gaussian mixtures with distinct parameters.
Mixtures stabilized under convolution with centered Gaussians.

Negative examples are provided to clarify the tightness of the condition.

Theoretical and Practical Implications

The theoretical analysis reveals that the attention-induced coupling in transformers leads to non-local PDEs absent in previous mean-field analyses of deep neural networks. The establishment of well-posedness and convergence for gradient-based training in this regime provides a rigorous underpinning for observed empirical phenomena in overparameterized transformers. The PL inequality derived for this setting confirms that, given a sufficiently expressive attention mechanism (injective NTK), and small-enough initialization loss, the optimization will not get trapped in poor local minima.

This result bridges a crucial gap between abstract mean-field analysis (previously limited mainly to shallow networks or ODE-like residual architectures) and the real operational transformer models with many heads and layers.

Open Problems and Future Directions

The framework in the paper is currently constrained to encoder-style (non-causal) self-attention, as extending to causal attention in decoder transformers would require fundamentally new technical tools due to loss of the permutation symmetry. Additionally, typical practical transformer architectures scale head dimension with the number of heads—a structural property not yet encompassed in the analyzed mean-field limit, which assumes fixed head dimension and infinite number of heads.

Another promising avenue is extending these ideas to multi-modal or vision transformers that process thousands of image patches, further exploiting the dual mean-field limits over tokens and attention heads.

Conclusion

This paper rigorously establishes the mathematical principle that the training of exceedingly deep and wide transformers—when formulated in the mean-field regime—can be understood as gradient flows in conditional Wasserstein spaces, governed by neural PDEs that arise due to attention. The injectivity and positive definiteness of the transformer NTK for attention is both necessary and sufficient for absence of spurious local minima and guarantees global convergence, provided the initial loss is sufficiently small. The developed theoretical techniques open doors to precise analysis of optimization landscapes and convergence behavior in modern large-scale transformer architectures, with compelling implications for further developments in deep learning theory.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

Explain-like-I’m-14: “Training Infinitely Deep and Wide Transformers”

1) What is this paper about?

This paper builds a clear, math-based way to understand how very large transformers learn. “Very large” here means:

infinitely many layers (infinite depth), and
infinitely many attention heads per layer (infinite width).

In that limit, the model behaves like a smooth system that moves around huge clouds of tokens and parameters. The authors show that:

the forward pass (what the model computes) is well-defined,
the training signal (the gradient) can be written down exactly, and
under reasonable conditions, training moves toward a best possible solution (a global minimum), at least when you start close enough to it.

They also explain exactly when the attention mechanism has a strong, helpful “fingerprint” (called an injective NTK), which is key to good training behavior.

2) What questions are they trying to answer?

To make the goals concrete, here are the main questions the authors study:

If we take transformers with extremely many layers and attention heads, does the forward pass still make sense mathematically?
Can we write an exact formula for the training gradients in this “infinite” view?
Do these gradients define a clean, well-behaved training process that doesn’t break?
Under what conditions does gradient-based training avoid getting stuck in bad local minima and instead reach the best solution?
What specific property of attention (captured by something called the Neural Tangent Kernel, or NTK) guarantees this good behavior?

3) How do they study it? (With simple ideas and analogies)

The paper turns a huge transformer into a smoother, continuous picture so it’s easier to analyze.

Mean-field idea (think “clouds”):
- tokens become a cloud over the token space,
- attention heads become a cloud over the parameter space (matrices like Q, K, V, and a bias q).
Tokens move like a crowd: As the model goes deeper, tokens “flow” and change. Because attention makes each token look at the others, the whole crowd moves in a coupled way. This is like a fluid or traffic flow described by a PDE (partial differential equation): it tells how the whole cloud changes over “depth time.”
Flow maps (instructions for moving points): Instead of tracking the changing crowd directly, they track a “flow map,” which tells where each starting token ends up at any depth. This flow map satisfies an ODE (ordinary differential equation) but in a space of functions. This trick makes the math cleaner.
Computing gradients by running time backward (adjoint method): To know how to change parameters, you need gradients. The authors compute these exactly by a standard trick: run a related equation backward in “depth time.” This gives a precise gradient formula that matches what backpropagation would give in the infinite-depth limit.
Moving sand the cheapest way (Wasserstein and Conditional Optimal Transport): They measure changes in parameter clouds using a “best-way-to-move-sand” distance (Wasserstein distance). Because parameters are organized by layer, they use a layered version (Conditional Optimal Transport), which means “move the parameter sand, but don’t mix layers.”
A small tweak to attention for math and expressivity: They fix K = Identity (a harmless change of variables) and add a bias q in the query part. The bias q breaks certain symmetries and turns out to be important for proving good properties of the attention NTK.
NTK (Neural Tangent Kernel) as a fingerprint of learning: The NTK tells you how small parameter changes affect the outputs. If this “fingerprint” is injective (different parameter nudges produce different output nudges), training has a better shot at success. The authors give a sharp test for injectivity that boils down to uniqueness properties of “softmax-like” log-sum-exp functions.

4) What did they find, and why does it matter?

Here are the key results and why they’re important:

Forward pass is well-defined, even with infinite depth and width. They prove there’s a unique, stable way the token cloud moves through the network. This means the theoretical model makes sense and doesn’t “blow up.”
Exact gradient formula via adjoints. They derive a clear, usable expression for the gradient of the training loss in this infinite setting. That’s the core ingredient for analyzing and guiding training.
Training follows a well-posed gradient flow in a layered “move-the-sand” geometry. Using Conditional Optimal Transport, they show the training process behaves like a gradient flow (the smooth analog of gradient descent), with existence and uniqueness. This gives a strong mathematical foundation for training dynamics.
Sharp condition for a “good” attention NTK (injectivity).
- discrete token sets (with a mild distinctness condition),
- uniform distributions on cubes of different sizes,
- symmetric Laplace distributions with different covariances,
- two-component Gaussian mixtures with different separations,
- and remains true even if you blur by a small Gaussian.
- This tells you when the model’s tiny parameter changes produce distinct, informative effects on outputs—a key for learning.
Convergence to global minima near good starts (no bad local traps nearby). If the NTK meets their condition (in particular, the “V-part” is positive) and your starting loss is small enough, then the gradient flow converges to a global minimum. In simple words: near good initializations, there are no spurious local minima to trap you.

5) So what’s the bigger picture?

Practical guidance for design and initialization: The NTK conditions translate into constructive hints on how to initialize tokens and parameters so training is stable and effective.
Confidence in training very deep/wide transformers: By putting training on a solid mathematical footing, the paper explains why, in the right regime, gradient-based training should work well, even for huge models.
Tools for long-context and large-token settings: Since the method treats tokens as clouds, it naturally scales to very many tokens (long sequences or many image patches).
A bridge between deep learning and modern math: The work connects transformers to ideas from optimal transport (moving sand), ODE/PDE theory (crowd flows), and kernel methods (NTK), opening doors to new training strategies and analyses.

A small note on “infinite depth and width”

No one actually trains a truly infinite model. But studying the infinite limit is a classic math trick: it smooths out messy details and reveals the core behavior. The insights often predict and explain what happens in very large, but finite, real models.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a single, concrete list of unresolved gaps, limitations, and open questions that emerge from the paper. Each point is phrased to guide actionable follow-up research.

Finite-width/finite-depth approximation: No quantitative rates relate the infinite-width/depth mean-field gradient flow and forward PDE to practical transformers with a finite number of heads H and discrete layers L. What H,L scalings ensure an ε-accurate approximation of the forward dynamics and of the training trajectory?
Discretization error analysis: Absent are error bounds for (i) time-discretizing the continuous-depth ODE/PDE (layer-wise residual blocks) and (ii) Monte Carlo approximations of the parameter measure ρ by finitely many heads. How do these errors accumulate over depth and affect training?
From gradient flow to (stochastic) gradient descent: The theory covers continuous-time gradient flows in the Conditional OT metric, but does not analyze discrete-time GD/SGD (with noise and minibatching). Under what step-size and batch-size regimes do discrete iterates track the metric gradient flow?
Stability and robustness of training dynamics: While well-posedness is shown, quantitative stability (e.g., Lipschitz or Hölder continuity of solutions w.r.t. perturbations in data μ or parameters ρ) is not provided. Can one derive explicit stability constants, and how do they scale with depth and dimension?
Assumptions on token distributions: Results rely on compactly supported measures (and finite second moments). How do existence, uniqueness, and stability extend to unbounded or heavy-tailed token distributions, or to distributions with singular components (e.g., mixtures of discrete and continuous parts)?
Positional information and permutation invariance: The framework omits positional encodings. How can positional encodings (absolute or relative) be incorporated into the mean-field token PDE/flow-map model without breaking well-posedness and the adjoint/gradient-flow analysis?
Layer normalization and other architectural staples: Layer norm, pre/post-norm, dropout, and residual scalings are not modeled. Do the well-posedness and gradient-flow results persist when such operations are included, and how do they alter the NTK and PL-inequality arguments?
Feed-forward (MLP) blocks and multi-block stacks: The theory treats attention-driven flows; standard transformers alternate attention and MLP blocks. How to extend the neural PDE and COT gradient-flow framework to interleaved attention–MLP layers and analyze their joint training dynamics?
Causal masking and decoder/cross-attention: The analysis is for non-causal self-attention without masking. How to formulate and prove well-posedness and convergence for causal masks and decoder cross-attention, where the interaction graph is directional and input–output token sets differ?
Fixed K and added bias q: Keys are fixed to K=I and a query bias q is added for technical reasons. Although a change of variables is argued to preserve expressivity for attention scores, this restriction may interact with values V and downstream operations. What changes if K is learned, and do the regularity and NTK results still hold?
Softmax temperature/scaling: The softmax (log-sum-exp) lacks explicit temperature or 1/√d scaling. How do temperature and dimensional scaling affect Lipschitz constants, stiffness of the ODE/PDE, and the NTK injectivity/conditioning results?
NTK injectivity verification in practice: Injectivity is linked to linear independence of log-sum-exp functions modulo affine functions and shown for selected distributions, but practical datasets rarely match these idealized families. Can one devise empirical tests or sufficient data conditions to certify NTK injectivity on real corpora/images?
NTK evolution during training: Convergence is proved under positivity at initialization and small initial loss. There is no control of how λ_min of the NTK evolves along the training path as ρ drifts. Under what conditions does NTK positivity persist, and how large is the neighborhood where the PL inequality holds?
Size of the “small-loss” basin: The local convergence result requires sufficiently small initial loss but does not quantify this basin of attraction. Can one bound the permissible initial loss in terms of λ_min, dimension d, support radii, and parameter norms?
Global convergence beyond the local regime: The analysis does not exclude spurious minima outside the small-loss region. Under what additional assumptions (e.g., overparameterization, data separability) can one obtain global (not only local) convergence guarantees?
Quantitative constants and dimension dependence: Many bounds are qualitative. What are explicit dependencies of Lipschitz/regularity constants (e.g., for Aρ and the adjoint) on dimension d, token support diameter, and parameter norms ∥Q∥, ∥V∥, ∥q∥?
Parameter growth control: The well-posedness and uniqueness of the gradient flow assume boundedness of parameter norms (e.g., ∥V∥ ≤ R at t=0). Are there a priori bounds that ensure parameters remain bounded along training, or do we need explicit regularization (e.g., weight decay) to prevent blow-up?
Numerical adjoint computations: The adjoint lives in the dual of C^{0(S,ℝ^d),} i.e., finite vector measures. How can the adjoint ODEs be discretized and solved stably in practice, and what are the error bounds when replacing measure-valued adjoints by finite-dimensional approximations?
Multiple-output/sequence-level losses: The risk is defined on a single token per example. How to extend the framework and convergence analysis to sequence-level objectives (e.g., losses on full token sets or functionals of the output measure)?
Generalization guarantees: The paper analyzes training risk only. Can one derive generalization bounds (e.g., via stability of gradient flows in the COT metric or Wasserstein compression arguments) that connect training dynamics to test performance?
Concentration and sampling effects over tokens: With finitely many tokens per example, empirical μ may deviate from the population measure. What are finite-sample effects on the forward PDE, NTK conditioning, and convergence rates?
Expressivity and value nonlinearity: Values are linear (Vy). How do expressivity and NTK properties change if value maps are nonlinear or if an output projection mixes heads (as in standard multi-head attention)?
Metric choice and algorithmic enforcement: The COT metric enforces a uniform marginal over depth s. How can one enforce this constraint in discrete training algorithms, and is the convergence sensitive to the specific choice of metric (e.g., alternatives to COT)?
Singular behavior and clustering: Prior work reports token clustering dynamics under attention PDEs. Does the proposed training flow ever drive μ toward singular or clustered states that affect differentiability or optimization (e.g., ill-conditioning of adjoint/NTK), and can one prevent such degeneracies?
Empirical validation: No experiments validate the theoretical assumptions (e.g., NTK injectivity on real data) or illustrate the predicted local linear convergence. Which diagnostics and datasets would most effectively stress-test the theory?

View Paper Prompt View All Prompts

Practical Applications

Overview

This paper provides a rigorous, mean-field training theory for transformers that are both infinitely deep (continuous in depth) and infinitely wide (infinitely many attention heads). It models token sets as probability measures and shows that training corresponds to a gradient flow in a Conditional Optimal Transport (COT) geometry, with explicit adjoint formulas for gradients. A central practical insight is a set of necessary and sufficient conditions for Neural Tangent Kernel (NTK) injectivity in attention, which (under small initial loss) yields convergence to global minima and eliminates spurious local minima. These conditions are satisfied in common situations when: (i) a query bias term is included, and (ii) token distributions are sufficiently non-degenerate (e.g., discrete tokens with distinct values, uniform on cubes with different radii, Gaussian mixtures), optionally stabilized by mild Gaussian smoothing.

Below are actionable applications derived from these results, grouped by deployment horizon. Each item notes targeted sectors, potential tools/workflows, and key assumptions or dependencies that may affect feasibility.

Immediate Applications

Attention design guideline: add a query bias and ensure non-degenerate token distributions
- Sectors: software/AI infrastructure; NLP, vision; enterprise ML
- Tools/workflows/products: update attention blocks to include a learnable query bias; prefer initializations and embeddings that avoid symmetries; add light Gaussian jitter to token embeddings as a stabilizer when needed
- Assumptions/dependencies: NTK injectivity relies on the presence of a bias term and token-distribution diversity; convergence guarantees are local and require sufficiently small initialization loss
NTK diagnostic checks for attention modules
- Sectors: MLOps, model validation; healthcare/finance (regulated ML); foundation model training
- Tools/workflows/products: “NTK-Attn Check” that estimates the smallest eigenvalue of the empirical attention NTK (or a proxy) per layer and flags near-singularity; integrate into CI/CD for models; use as a gating check for large-scale training runs
- Assumptions/dependencies: empirical NTK approximates mean-field NTK better for wider models; diagnostics require representative token batches and efficient kernel estimation
Memory- and depth-efficient training via adjoint sensitivity for continuous-depth transformers
- Sectors: software/AI infrastructure; cloud platforms; long-context language and vision models
- Tools/workflows/products: adapt NODE-style adjoint backprop to transformer blocks unrolled in (quasi-)continuous depth to reduce memory footprint for very deep stacks; integrate into PyTorch/JAX libraries
- Assumptions/dependencies: adjoint method benefits increase with depth; continuous-depth approximation should track discrete stacks (e.g., small layer step sizes); stable ODE solvers and careful error control needed
Layer-wise training regularization inspired by COT geometry
- Sectors: model training; enterprise AI; safety-critical ML
- Tools/workflows/products: per-layer learning-rate schedules and movement penalties that mimic conditional Wasserstein constraints (e.g., regularizers proportional to integrated per-layer parameter movement norms); layer-wise trust-region updates
- Assumptions/dependencies: the COT metric is a mathematical idealization; use practical surrogates (e.g., per-layer L2 movement penalties); hyperparameters require tuning
Data curation and augmentation to meet NTK-positivity conditions
- Sectors: data engineering; NLP/vision pretraining; RAG systems
- Tools/workflows/products: enforce token diversity in training batches (e.g., sampling policies that avoid degenerate token sets); pretraining pipelines that ensure mixtures of token distributions; mild Gaussian smoothing of embeddings where appropriate
- Assumptions/dependencies: must preserve task-relevant structure; augmentation should not harm downstream accuracy; small additive noise works best early in training
Training stability monitors based on flow-map regularity
- Sectors: MLOps; platform reliability; cloud cost control
- Tools/workflows/products: runtime indicators derived from proxy bounds on the flow-map norm or effective Lipschitz constants to detect instability/exploding updates in very deep stacks; early stopping/trust-region adjustments
- Assumptions/dependencies: relies on proxies (e.g., norms of value matrices V across layers) since exact constants are intractable; alerts should correlate with observed loss spikes
Safer hyperparameter defaults for very deep/very wide transformers
- Sectors: model templates; AutoML; foundation model stacks
- Tools/workflows/products: defaults that encourage small initial loss (e.g., conservative init scales for V, Q, q), explicit query bias, and per-layer step sizes; include NTK diagnostics in hyperparameter sweeps
- Assumptions/dependencies: small-initial-loss regime improves convergence guarantees; defaults must balance stability with representation learning
Academic teaching and analysis aids for transformer dynamics
- Sectors: academia; education; research labs
- Tools/workflows/products: course modules and interactive notebooks visualizing token-measure evolution (PDE) and flow-maps (ODE), adjoint-based gradient derivations, and NTK injectivity examples
- Assumptions/dependencies: pedagogical materials require simplified settings (e.g., 2D tokens) and efficient visualization tools

Long-Term Applications

Continuous-depth transformer architectures with provable training dynamics
- Sectors: robotics (streaming control), time-series; real-time systems
- Tools/workflows/products: “ODE-Transformers” with continuous depth and adjoint training for streaming inputs; adaptive-depth inference; stability-aware controllers
- Assumptions/dependencies: requires mature ODE solvers integrated with attention; robust discretization-error control; alignment between continuous and discrete deployments
Particle-based mean-field training of attention head distributions
- Sectors: foundation model pretraining; AI optimization
- Tools/workflows/products: “Particle-Head Optimizers” that evolve head parameters as particles via Wasserstein/Stein variational dynamics to approximate mean-field gradient flows
- Assumptions/dependencies: computational overhead of particle methods must be amortized by improved stability/sample efficiency; requires scalable parallelism
Certifiable training pipelines with convergence certificates
- Sectors: healthcare, finance, public-sector AI; compliance and audit
- Tools/workflows/products: training logs that include NTK injectivity checks, minimal eigenvalue trajectories, and evolution–dissipation inequality (EDI) diagnostics; model cards reporting stability metrics
- Assumptions/dependencies: certificates hold under local small-loss regimes and approximations of NTK; auditors need standardized procedures and thresholds
Token-distribution-aware curricula and scheduling
- Sectors: education tech, LLMs, multimodal models
- Tools/workflows/products: curriculum learning that sequences token distributions to maintain NTK positivity early-on, then gradually increases complexity; dynamic batch composers that target well-conditioned regimes
- Assumptions/dependencies: requires online estimation of conditioning; may trade off short-term task metrics for long-term stability
Transformer controllers for dynamical systems leveraging PDE coupling
- Sectors: autonomous vehicles, industrial control, robotics
- Tools/workflows/products: controllers that interpret multi-sensor streams as token measures and evolve them through continuous-time attention with stability-aware training
- Assumptions/dependencies: closed-loop safety needs additional verification; mappings from tokens to actuation must meet real-time constraints
Architectures that exploit emergent token clustering
- Sectors: retrieval-augmented generation (RAG), search, summarization
- Tools/workflows/products: attention layers designed to intentionally cluster token distributions over depth, improving retrieval and compression in long-context applications
- Assumptions/dependencies: relies on theoretical clustering results; must ensure clusters align with semantic structure; interplay with positional encoding and recurrence
Benchmarking and standards for training stability in large models
- Sectors: policy, standardization bodies, industry consortia
- Tools/workflows/products: standardized reports of NTK diagnostics, conditioning measures, and flow-map stability for long-context and deep stacks; certification criteria for training procedures
- Assumptions/dependencies: community consensus on metrics; efficient, reproducible estimation practices; recognition by regulators
Hardware/compiler support for continuous-depth and measure-based training
- Sectors: semiconductors, cloud accelerators, compilers
- Tools/workflows/products: solver-friendly kernels for continuous-depth attention; primitives for adjoint backprop; libraries optimizing distributional (measure-valued) computations
- Assumptions/dependencies: requires demand from practitioners; tight integration with ML frameworks; careful handling of numerical stability

Cross-cutting assumptions and dependencies

Mean-field approximations: Practical gains presume wide heads and deep stacks where mean-field and continuous-depth limits approximate finite models.
Initialization regime: Convergence to global minima is local (small initial loss); practitioners may need warm-starts or conservative initialization.
Attention parameterization: Theoretical results use a fixed key matrix (reparametrization equivalence) and add a query bias; adopting equivalent practical parameterizations is straightforward.
Token setting: Analysis targets encoder-style (non-causal) settings with large token sets; some ideas extend to autoregressive contexts but require additional care.
Regularity and solver choices: Continuous-depth training relies on stable ODE solvers and adjoint implementations; discretization error must be controlled in production.
Diagnostics vs guarantees: Empirical proxies (e.g., approximate NTKs, flow-map norms) must be validated to serve as reliable early warnings of instability.

These applications translate the paper’s mathematical insights into concrete design choices, diagnostics, training workflows, and longer-term research and product directions that can improve the stability, efficiency, and certifiability of training very deep and wide transformers.

View Paper Prompt View All Prompts

Glossary

Absolutely continuous curve: A path in a metric space whose variation is integrable and admits a metric derivative almost everywhere. Example: "We say a locally absolutely continuous curve $(\rho_t)_{t \in [0, +\infty)}$ is a gradient flow for the risk $\Ll$ starting from $\rho_0$ "
Adjoint equations: Backward-in-time linear differential equations that characterize how sensitivities (adjoint variables) evolve, used to compute gradients. Example: "we will not, in fact, directly rely on the above linear ODEs but rather on adjoint equations"
Adjoint method: A computational technique for gradient evaluation via solving adjoint (backward) differential equations. Example: "Numerical resolution of the adjoint equations is at the core of the adjoint method developed for the training of NODE models~\cite{chen2018neural}."
Adjoint sensitivity analysis: A method to derive gradients with respect to parameters by introducing adjoint variables satisfying backward equations. Example: "Using adjoint sensitivity analysis, we derive an explicit formula for the conditional Wasserstein gradient of the training risk"
Adjoint variables: Dual variables governed by backward ODEs that capture the sensitivity of the loss to perturbations in the forward dynamics. Example: "involving adjoint variables governed by backward ODEs."
Banach-space-valued ODE: An ordinary differential equation where the unknown is a function taking values in a Banach space. Example: "we study here deep Transformers seen as parameterized ODEs on the Banach space $\Cc^0(S, \RR^d)$ (\cref{eq:flow_map_ODE}). As a consequence... we study here deep Transformers seen as parameterized ODEs on the Banach space $\Cc^0(S, \RR^d)$ (\cref{eq:flow_map_ODE}). As a consequence... the (Banach-space-valued) ODE:"
Bochner integral: The integral of a Banach space–valued function, generalizing the Lebesgue integral to vector-valued settings. Example: "In particular, \cref{eq:A_rho} is to be understood as a Bochner integral"
Borel probability measure: A probability measure defined on the Borel σ-algebra of a topological space. Example: "$\Pp(X)$ is the set of Borel probability measures on $X$ "
Borel velocity field: A measurable (Borel) vector field driving a continuity equation in a metric-measure space setting. Example: "where $v : I \times [0,1] \times \Theta \to \Theta$ is some Borel velocity field"
Carathéodory condition: A regularity condition for ODE right-hand sides: measurable in time and (locally) Lipschitz in state, ensuring existence of solutions. Example: "satisfies the CarathÃ©odory condition"
Carathéodory solutions: Solutions to ODEs with right-hand sides satisfying Carathéodory conditions, defined via integral formulations. Example: "this equation is to be understood in the sense of CarathÃ©odory solutions"
Compactly supported probability measure: A probability measure whose support is contained within a compact set. Example: "a compactly supported probability measure $\mu$ with $\Supp(\mu) \subset S$"
Conditional Optimal Transport (COT) metric: A layer-wise Wasserstein metric that preserves a fixed marginal (e.g., uniform in depth) during transport. Example: "modeled by metric gradient flows on the space of parameterizations when provided with a Conditional Optimal Transport (COT) metric."
Conditional Wasserstein gradient: The gradient of a functional defined with respect to a Wasserstein geometry that conditions on an external variable (e.g., depth). Example: "the conditional Wasserstein gradient of the training risk"
Curve of maximal slope: A variational characterization of gradient flows in metric spaces, balancing energy decrease and metric speed. Example: "via the theory of curves of maximal slope in the conditional Wasserstein metric"
Dirac measure: A probability measure concentrated at a single point. Example: "For $x \in X$ , we denote by $\delta_x \in \Pp(X)$ the Dirac measure at $x$ ."
Disintegration (of measures): Decomposition of a joint measure into conditional measures with respect to a marginal. Example: "which is obtained by taking the disintegration of $\rho$ w.r.t.\@ the Lebesgue measure on $[0,1]$ "
Evolution Dissipation Inequality (EDI): An inequality expressing that along a gradient flow the rate of decrease of the energy is bounded by dissipation terms. Example: "the following Evolution Dissipation Inequality holds:"
Flow-map: The mapping that transports initial data along the trajectories defined by a (time-dependent) velocity field. Example: "we denote by $\Lambda_\rho[\mu(s=0)](s, x) := x(s)$ the mapping obtained by integrating the ODE"
Fréchet differential: The linear operator that best approximates a function between Banach spaces near a point (the Banach-space analogue of the derivative). Example: "its (FrÃ©chet) differential $\delta \Lambda_t \eqdef \frac{d}{d t} \Lambda_t$"
Gaussian mixture: A probability distribution formed as a convex combination of Gaussian components. Example: "two-component Gaussian mixtures"
Lazy training: A regime where parameters stay close to their initialization and training dynamics are effectively linearized. Example: "In this lazy training regime"
Lebesgue measure: The standard notion of volume measure on Euclidean space. Example: "w.r.t.\@ the Lebesgue measure on $[0,1]$ "
Log-sum-exp functions: Smooth convex functions of the form log of a sum of exponentials, closely related to cumulant generating functions. Example: "log-sum-exp (cumulant generating) functions"
Mean field over tokens: Modeling a large (possibly infinite) set of tokens as a probability measure over the token space, leading to PDE descriptions. Example: "We refer to this approach as ``mean field over tokens''."
Mean-field regime: A scaling limit in which network parameters are represented as probability measures and trained via measure-valued dynamics. Example: "the mean-field regime, where both the depth (number of layers) and width (number of attention heads) tend to infinity."
Metric derivative: The speed of a curve in a metric space, defined as the limit of distance increments over time. Example: "with $\left| \frac{d}{d t} \rho_t \right|$ the metric derivative of the curve $(\rho_t)_{t \geq 0}$ "
Neural ODE: The continuous-depth limit of residual networks modeled as an ordinary differential equation on features. Example: "Chen et al.~\cite{chen2018neural} introduced Neural ODEs"
Neural PDE: A continuous model where data distributions evolve via a partial differential equation driven by learnable operators. Example: "transformer training corresponds to controlling a neural partial differential equation (PDE)"
Neural SDE: A stochastic differential equation limit of deep residual networks under a different scaling. Example: "while $1/\sqrt{\text{depth}$ scaling leads to a neural SDE"
Neural Tangent Kernel (NTK): A kernel capturing the linearized training dynamics of overparameterized networks around initialization. Example: "The Neural Tangent Kernel (NTK) regime~\cite{jacot_neural_2021} arises when weights scale as $1/\sqrt{\text{width}$."
NTK injectivity: The property that the NTK map distinguishes different functions/parameters, often linked to positive definiteness. Example: "NTK injectivity is equivalent to linear independence of log-sum-exp functions modulo affine functions"
Non-local PDE: A partial differential equation where the evolution depends on integrals over the state (e.g., coupling across tokens via attention). Example: "This results in the study of non-local PDEs on the space of token distributions"
Polyak-Łojasiewicz inequality: A condition linking loss suboptimality to the squared gradient norm, implying linear convergence of gradient methods. Example: "proved local convergence (when the initial loss is sufficiently small) under a Polyak-Åojasiewicz condition."
Pushforward (of measures): The image measure obtained by transporting a measure through a measurable map. Example: "for a measurable map $f : X \to Y$ ..., $f_\# \rho$ denotes the pushforward of $\rho$ by $f$ ."
Reproducing kernel Hilbert space (RKHS): A Hilbert space of functions associated with a positive-definite kernel, enabling kernel regression. Example: "kernel regression in a reproducing kernel Hilbert space (RKHS)."
Total variation norm: A norm on (vector-)measures equal to the measure’s total mass, measuring its size. Example: "the total variation norm is denoted by $\|.\|_{TV}$ ."
Wasserstein distance: An optimal transport metric between probability measures based on the cost of moving mass. Example: "We denote by $\Ww_p$ the Wasserstein distance on $\Pp_p(X)$"
Wasserstein gradient flow: A gradient flow on the space of probability measures endowed with the Wasserstein metric. Example: "Wasserstein gradient flows~\cite{jordan1998variational,ambrosio2008gradient} describe the training dynamics"
Wasserstein manifold: The geometric space of probability measures under the Wasserstein metric, treated as a manifold-like structure. Example: "training dynamics evolve on the Wasserstein manifold of probability measures"
Well-posedness: The property that a problem admits a unique solution that depends continuously on data. Example: "We establish well-posedness of the forward pass through infinitely deep transformers"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Training Infinitely Deep and Wide Transformers

Summary

Training Infinitely Deep and Wide Transformers: A Technical Perspective

Introduction and Motivation

Mean-Field Regimes and the Neural PDE Formulation

Infinite Width and Depth Limits

Conditional Optimal Transport and Wasserstein Geometry

Mathematical Results

Well-Posedness of the Forward Process

Gradient Flows and Adjoint Sensitivity

Existence, Uniqueness, and Characterization of Gradient Flow

Optimization Landscape and Convergence Analysis

Polyak-Łojasiewicz (PL) Inequality via Neural Tangent Kernel Analysis

Necessary and Sufficient Conditions for NTK Injectivity

Theoretical and Practical Implications

Open Problems and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

Explain-like-I’m-14: “Training Infinitely Deep and Wide Transformers”

1) What is this paper about?

2) What questions are they trying to answer?

3) How do they study it? (With simple ideas and analogies)

4) What did they find, and why does it matter?

5) So what’s the bigger picture?

A small note on “infinite depth and width”

Knowledge Gaps

Practical Applications

Overview

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research