Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 79 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 45 tok/s
GPT-5 High 43 tok/s Pro
GPT-4o 103 tok/s
GPT OSS 120B 475 tok/s Pro
Kimi K2 215 tok/s Pro
2000 character limit reached

Tensor Programs V (TP5): Infinite-Width Attention

Updated 18 August 2025
  • The paper introduces an exact infinite-width characterization for a single attention layer using standard 1/√n scaling, unveiling a fundamentally non-Gaussian and hierarchical output distribution.
  • Tensor Programs V (TP5) extends prior frameworks by incorporating dot-product similarity scores, enabling conditional Gaussian limits and a unified treatment of realistic attention mechanisms.
  • Numerical validations demonstrate that empirical outputs converge to the predicted hierarchical distributions, offering practical insights into Transformer theory and architecture design.

Tensor Programs V (TP5) refers to the rigorous asymptotic analysis of neural network attention layers in the infinite-width limit, extending the Tensor Programs framework. Previous infinite-width analyses mostly relied on Gaussian process approximations and could only treat attention either in narrow regimes (infinitely many heads, special scalings) or with significant simplifying assumptions. TP5 provides an exact infinite-width characterization for a single attention layer with realistic architectural dimensionality and 1/n1/\sqrt{n} scaling, revealing a fundamentally non-Gaussian, hierarchical structure for the output distribution.

1. Asymptotic Analysis of Attention Layers

TP5 establishes the limiting law for all variables inside a single attention layer as the channel dimensionality nn tends to infinity, with a fixed number of heads and the standard 1/n1/\sqrt{n} scaling. This involves tracking not only the standard vector-valued network outputs but also the scalar similarity scores arising from dot-products of queries and keys.

For any pseudo-Lipschitz observable ψ\psi, the empirical mean

1nα=1nψ(hα1,,hαk)\frac{1}{n}\sum_{\alpha=1}^n \psi(h^1_\alpha, \ldots, h^k_\alpha)

converges in probability to

E[ψ(Zh1,,Zhk)],\mathbb{E}\left[\psi(Z^{h^1}, \ldots, Z^{h^k})\right],

where ZhiZ^{h^i} are hierarchical random variables defined via the Tensor Programs induction rules and conditioned on the random similarity scores. Crucially, TP5 avoids the infinite-head or tailored scaling regimes previously required for mathematical tractability.

2. Tensor Programs Framework Extension

Tensor Programs are inductively constructed by composing vector-valued "MatMul" rules and coordinatewise "Nonlin" rules. TP5 extends this framework to fully incorporate dot-product variables:

  • For vectors produced by the Tensor Program, limiting random vectors ZhZ^h are defined by classical master theorem.
  • For dot-product similarity scores pi,j(a)p_{i,j}^{(a)} generated as

pi,j(a)=(WQ,axi)(WK,axj)n,p_{i,j}^{(a)} = \frac{\left(W^{Q,a} x^i\right)^\top \left(W^{K,a} x^j\right)}{\sqrt{n}},

their limiting distribution (denoted p˚\mathring{p}) is independent Gaussian and treated as a primitive random variable.

The main theorem generalizes the empirical kernel convergences to hierarchical limits involving both ZhZ^h and p˚\mathring{p}, i.e.,

1nα=1nψ(hα1,,hαk)dE[ψ(Zh1,,Zhk)p˚1,,p˚r].\frac{1}{n}\sum_{\alpha=1}^n \psi(h_\alpha^1, \ldots, h_\alpha^k) \xrightarrow{d} \mathbb{E}[\psi(Z^{h^1}, \ldots, Z^{h^k})\,|\,\mathring{p}_1, \ldots, \mathring{p}_r].

This result is valid for all bounded, pseudo-Lipschitz observables.

3. Hierarchical, Non-Gaussian Limit Laws

TP5 shows that the infinite-width limit of a realistic attention layer is fundamentally non-Gaussian. The limiting output variables are Gaussian conditional on the similarity scores, but the marginal distribution is a mixture of Gaussians—explicitly hierarchical:

  • Each vector-valued variable ZVZ^{\mathcal{V}} (e.g. value-projection outputs) is independent of the similarity scores p˚\mathring{p}.
  • The output of multi-head attention, for spatial location ii, is expressed as

Zyi=a=1Hj=1sSoftMaxj(p˚i,1(a),,p˚i,s(a))ZV~(a,j),Z^{y^i} = \sum_{a=1}^H \sum_{j=1}^s \mathrm{SoftMax}_j(\mathring{p}_{i,1}^{(a)}, \ldots, \mathring{p}_{i,s}^{(a)}) \cdot Z^{\tilde{\mathcal{V}}^{(a, j)}},

where the SoftMax\mathrm{SoftMax} weights induce nonlinear mixing over the random p˚\mathring{p} scores.

In summary, the limiting law is Gaussian given p˚\mathring{p} and globally non-Gaussian as p˚\mathring{p} varies. This is a sharp departure from feedforward layers, which have purely Gaussian infinite-width limits.

4. Numerical Validation and Scaling Effects

TP5 validates the theoretical predictions using numerical simulations for a range of widths nn, scaling rules, and head numbers HH:

  • As nn increases, the empirical distribution of outputs converges to the theoretical hierarchical non-Gaussian law; KL divergence between empirical and predicted densities decreases as nn \to \infty.
  • Comparison of classical 1/n1/\sqrt{n} versus non-standard $1/n$ scaling reveals collapse of dot-product scores to degenerate distributions under the latter (as in previous works), confirming the necessity for standard scaling.
  • Even at moderate widths, the finite head, 1/n1/\sqrt{n} regime accurately matches the theoretical limits.

These findings confirm both the qualitative non-Gaussian character and the quantitative predictions of TP5.

5. Implications for Transformer Theory and Architecture

By providing a closed-form infinite-width theory for the attention mechanism under realistic settings, TP5 enables:

  • Unified theoretical paper of deep Transformer architectures using the Tensor Programs formalism. Attention can now be treated without artificial assumptions (infinite head count, nonphysical scaling).
  • Accurate prediction of layer outputs, inductive bias, and feature evolution for standard model initializations.
  • Investigation into training dynamics, mean-field kernels, and neural tangent kernel behavior for Transformers at scale—potentially informing architecture design and initialization strategies.

A plausible implication is that the hierarchical, mixture-of-Gaussians structure identified in TP5 may underlie key aspects of Transformer generalization and robustness.

6. Mathematical Formulations and Examples

Some central formulas derived and utilized in TP5 include:

Formula Role Description
pi,j(a)=(WQ,axi)(WK,axj)np_{i,j}^{(a)} = \frac{(W^{Q,a} x^i)^\top (W^{K,a} x^j)}{\sqrt{n}} similarity score Standard attention dot-product, scaled for infinite-width limit
Zyi=a=1Hj=1sSoftMaxj(p˚i,1(a),...,p˚i,s(a))ZV~(a,j)Z^{y^i} = \sum_{a=1}^H \sum_{j=1}^s \mathrm{SoftMax}_j(\mathring{p}_{i,1}^{(a)}, ..., \mathring{p}_{i,s}^{(a)}) \cdot Z^{\tilde{\mathcal{V}}^{(a, j)}} attention output Hierarchical mixture-form for multi-head attention
1nα=1nψ(hα1,...,hαk)dE[ψ(Zh1,...,Zhk)p˚1,...,p˚r]\frac{1}{n}\sum_{\alpha=1}^n \psi(h_\alpha^1, ..., h_\alpha^k) \to^d \mathbb{E}\left[\psi(Z^{h^1}, ..., Z^{h^k}) | \mathring{p}_1, ..., \mathring{p}_r\right] master theorem Empirical averages converge to conditional hierarchical limits

This mathematical machinery enables rigorous, arXiv-grade analysis of attention mechanisms beyond previous GP-style approaches.

7. Context and Significance

TP5 unifies and extends previous Tensor Programs work by closing the analytic gap for attention-based architectures. By characterizing the non-Gaussian infinite-width limit of a single attention layer under standard initialization and scaling, it provides a mathematically sharp foundation for ongoing analysis of deep Transformers, including the nontrivial statistical properties crucial for inference and learning. Numerical results confirm the quantitative validity of the hierarchical theory even at practical network widths.

This framework thus positions Tensor Programs as a central tool for modern deep learning theory, supporting both foundational research and practical investigation into attention-centric deep architectures (Sakai et al., 1 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)