Dice Question Streamline Icon: https://streamlinehq.com

Determine the exact spectrum and Coulomb gas potential for transformer attention weight matrices

Determine the exact eigenvalue distribution (the stationary spectrum) and the corresponding Coulomb gas potential V_i(x) governing the Dyson Brownian motion of X = K^T K for the Key matrix in the nano-GPT transformer studied, thereby extending the fully analytic characterization available for the Gaussian restricted Boltzmann machine to this transformer setting.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper models learning dynamics of weight matrices via Dyson Brownian motion, leading to a Coulomb gas description whose stationary distribution depends on a model-specific potential V_i(x). For the Gaussian restricted Boltzmann machine (RBM), the drift and potential are known analytically, enabling full control of the eigenvalue dynamics and final spectral distribution.

For transformers (specifically a nano-GPT with AdamW optimization), the authors analyze empirically the evolution of the eigenvalue spectrum of X = KT K for the Key matrix. While the unfolded level spacings match Random Matrix Theory predictions (Wigner surmise), the spectral density develops heavy tails and deviates from the Marchenko–Pastur form during training.

In this transformer setting, unlike the RBM, the exact final eigenvalue spectrum and the underlying Coulomb gas potential that would produce it are explicitly stated to be unknown. Determining these would clarify the non-universal aspects of the spectrum and provide a principled, model-specific characterization within the Coulomb gas framework.

References

This evolution should be compared to the evolution in the RBM in Fig.~\ref{fig:RBM_eig_flow}, with the notable difference that the ``exact'' spectrum or Coulomb gas potential are not known.

Dyson Brownian motion and random matrix dynamics of weight matrices during learning (2411.13512 - Aarts et al., 20 Nov 2024) in Section 3.2 (Transformer)