Heavy-tailed targets and power-law spectral tails

Investigate whether, in the empirical risk minimization of single-head tied attention under the high-dimensional regime of this paper, adopting a heavy-tailed distribution for the target weight matrix S0 yields power-law tails in the singular-value distribution of the learned weights, thereby reproducing the heavy-tailed spectral phenomenology observed empirically in large transformers.

Background

The authors derive an exact asymptotic spectral law for the learned weights and show it matches several empirical features, but power-law tails do not arise under the Marchenko–Pastur target used in their experiments.

They conjecture that heavy-tailed target spectra could generate the missing power-law behavior and explicitly defer the investigation of such models.

References

The other main feature, i.e. power-law tails, are not observed in the MP target. We conjecture that a model with heavy-tailed target distribution would feature such phenomenology, but leave such exploration for future work.

— Inductive Bias and Spectral Properties of Single-Head Attention in High Dimensions (2509.24914 - Boncoraglio et al., 29 Sep 2025) in Section 4, Exact spectral law of the learned weights

Heavy-tailed targets and power-law spectral tails

Sponsor

Background

References

Related Problems