Tensor Programs V (TP5): Infinite-Width Attention
- The paper introduces an exact infinite-width characterization for a single attention layer using standard 1/√n scaling, unveiling a fundamentally non-Gaussian and hierarchical output distribution.
- Tensor Programs V (TP5) extends prior frameworks by incorporating dot-product similarity scores, enabling conditional Gaussian limits and a unified treatment of realistic attention mechanisms.
- Numerical validations demonstrate that empirical outputs converge to the predicted hierarchical distributions, offering practical insights into Transformer theory and architecture design.
Tensor Programs V (TP5) refers to the rigorous asymptotic analysis of neural network attention layers in the infinite-width limit, extending the Tensor Programs framework. Previous infinite-width analyses mostly relied on Gaussian process approximations and could only treat attention either in narrow regimes (infinitely many heads, special scalings) or with significant simplifying assumptions. TP5 provides an exact infinite-width characterization for a single attention layer with realistic architectural dimensionality and scaling, revealing a fundamentally non-Gaussian, hierarchical structure for the output distribution.
1. Asymptotic Analysis of Attention Layers
TP5 establishes the limiting law for all variables inside a single attention layer as the channel dimensionality tends to infinity, with a fixed number of heads and the standard scaling. This involves tracking not only the standard vector-valued network outputs but also the scalar similarity scores arising from dot-products of queries and keys.
For any pseudo-Lipschitz observable , the empirical mean
converges in probability to
where are hierarchical random variables defined via the Tensor Programs induction rules and conditioned on the random similarity scores. Crucially, TP5 avoids the infinite-head or tailored scaling regimes previously required for mathematical tractability.
2. Tensor Programs Framework Extension
Tensor Programs are inductively constructed by composing vector-valued "MatMul" rules and coordinatewise "Nonlin" rules. TP5 extends this framework to fully incorporate dot-product variables:
- For vectors produced by the Tensor Program, limiting random vectors are defined by classical master theorem.
- For dot-product similarity scores generated as
their limiting distribution (denoted ) is independent Gaussian and treated as a primitive random variable.
The main theorem generalizes the empirical kernel convergences to hierarchical limits involving both and , i.e.,
This result is valid for all bounded, pseudo-Lipschitz observables.
3. Hierarchical, Non-Gaussian Limit Laws
TP5 shows that the infinite-width limit of a realistic attention layer is fundamentally non-Gaussian. The limiting output variables are Gaussian conditional on the similarity scores, but the marginal distribution is a mixture of Gaussians—explicitly hierarchical:
- Each vector-valued variable (e.g. value-projection outputs) is independent of the similarity scores .
- The output of multi-head attention, for spatial location , is expressed as
where the weights induce nonlinear mixing over the random scores.
In summary, the limiting law is Gaussian given and globally non-Gaussian as varies. This is a sharp departure from feedforward layers, which have purely Gaussian infinite-width limits.
4. Numerical Validation and Scaling Effects
TP5 validates the theoretical predictions using numerical simulations for a range of widths , scaling rules, and head numbers :
- As increases, the empirical distribution of outputs converges to the theoretical hierarchical non-Gaussian law; KL divergence between empirical and predicted densities decreases as .
- Comparison of classical versus non-standard $1/n$ scaling reveals collapse of dot-product scores to degenerate distributions under the latter (as in previous works), confirming the necessity for standard scaling.
- Even at moderate widths, the finite head, regime accurately matches the theoretical limits.
These findings confirm both the qualitative non-Gaussian character and the quantitative predictions of TP5.
5. Implications for Transformer Theory and Architecture
By providing a closed-form infinite-width theory for the attention mechanism under realistic settings, TP5 enables:
- Unified theoretical paper of deep Transformer architectures using the Tensor Programs formalism. Attention can now be treated without artificial assumptions (infinite head count, nonphysical scaling).
- Accurate prediction of layer outputs, inductive bias, and feature evolution for standard model initializations.
- Investigation into training dynamics, mean-field kernels, and neural tangent kernel behavior for Transformers at scale—potentially informing architecture design and initialization strategies.
A plausible implication is that the hierarchical, mixture-of-Gaussians structure identified in TP5 may underlie key aspects of Transformer generalization and robustness.
6. Mathematical Formulations and Examples
Some central formulas derived and utilized in TP5 include:
Formula | Role | Description |
---|---|---|
similarity score | Standard attention dot-product, scaled for infinite-width limit | |
attention output | Hierarchical mixture-form for multi-head attention | |
master theorem | Empirical averages converge to conditional hierarchical limits |
This mathematical machinery enables rigorous, arXiv-grade analysis of attention mechanisms beyond previous GP-style approaches.
7. Context and Significance
TP5 unifies and extends previous Tensor Programs work by closing the analytic gap for attention-based architectures. By characterizing the non-Gaussian infinite-width limit of a single attention layer under standard initialization and scaling, it provides a mathematically sharp foundation for ongoing analysis of deep Transformers, including the nontrivial statistical properties crucial for inference and learning. Numerical results confirm the quantitative validity of the hierarchical theory even at practical network widths.
This framework thus positions Tensor Programs as a central tool for modern deep learning theory, supporting both foundational research and practical investigation into attention-centric deep architectures (Sakai et al., 1 Jun 2025).