Infinite Limits of Multi-head Transformer Dynamics (2405.15712v2)

Published 24 May 2024 in stat.ML, cond-mat.dis-nn, and cs.LG

Abstract: In this work, we analyze various scaling limits of the training dynamics of transformer models in the feature learning regime. We identify the set of parameterizations that admit well-defined infinite width and depth limits, allowing the attention layers to update throughout training--a relevant notion of feature learning in these models. We then use tools from dynamical mean field theory (DMFT) to analyze various infinite limits (infinite key/query dimension, infinite heads, and infinite depth) which have different statistical descriptions depending on which infinite limit is taken and how attention layers are scaled. We provide numerical evidence of convergence to the limits and discuss how the parameterization qualitatively influences learned features.

PDF HTML Abstract

Analysis of the Scaling Limits in Transformer Models

This paper examines the scaling limits inherent to transformer models, with a particular emphasis on their training dynamics in the feature learning regime. The authors meticulously analyze the infinite width and depth limits to discern how different parameterizations influence model behavior during training, using tools from dynamical mean field theory (DMFT) to elucidate these dynamics.

The paper is motivated by the persistent challenge of understanding the behavior of increasingly large transformer models, a class of architectures that has markedly improved performance in fields such as computer vision and LLMing. By identifying parameterizations that maintain stable training dynamics across scales—such as the mean field parameterization (μP)—the authors aim to provide theoretical groundwork that could guide empirical practices.

Contributions

The paper makes several notable contributions:

DMFT Derivation for Transformer Models: The authors derive the DMFT for randomly initialized transformers, particularly focusing on the key/query dimension $N$ , head count $\mathcal{H}$ , and depth $L$ .
Necessary Scaling for Infinite $N$ Limit: It is analytically demonstrated that the $N \to \infty$ limit necessitates μP scaling of the key/query inner product by $1/N$. This finding holds even when keys and queries are reparameterized to decrease gradient descent update sizes.
Head Dynamics in $N \to \infty$ Limit: The paper shows that in the $N \to \infty$ limit, multi-head self-attention effectively collapses to single-head attention, as all heads follow identical dynamics.
Addressing Head Collapse: To mitigate the collapse of multi-head attention, the authors analyze the $\mathcal{H} \to \infty$ limit at finite $N$ . Here, they find that attention dynamics remain distributed across heads, leading to deterministically evolving training dynamics.
Large Depth Limits: The paper further explores the large depth limits of transformers with residual branch scaling, identifying the tension between maintaining initial kernel structure and enabling feature learning within multi-head self-attention (MHSA) and multi-layer perceptron (MLP) blocks.

Implications and Theoretical Insights

Infinite Width and Head Limits

The paper's exploration of the infinite key/query dimension $N \to \infty$ highlights a critical aspect of transformer training: μP scaling is essential to maintain stable updates. Without this scaling, the dynamical variables would diverge, fundamentally altering the training process.

One particularly intriguing result is the effective collapse of multi-head attention to single-head dynamics as $N$ grows. This could imply diminishing returns when increasing $N$ unless counterbalanced by scaling the number of heads independently. The authors thus pivot to investigating the $\mathcal{H} \to \infty$ limit, arguing that distributed attention dynamics maintain diversity across heads and lead to deterministic training behavior.

Large Depth Limits

For the large depth limit $L \to \infty$ , the paper identifies two regimes based on the scaling parameter $\alpha_L$ . With $\alpha_L = 1$ , updates to MHSA and MLP blocks persist throughout training, albeit initial representations lose structure. Conversely, $\alpha_L = \frac{1}{2}$ preserves initial kernel structures but results in frozen weights within residual blocks, reminiscent of shallow feature learning but in the context of deep networks.

Empirical Evidence and Practical Considerations

The authors validate their theoretical models through empirical investigations on CIFAR-5M and the C4 dataset, demonstrating that scaling different model parameters produces varying degrees of performance improvement. Notably, increasing key/query dimensions ( $N$ ) under μP scaling improved stability across model scales, while scaling the number of heads ( $\mathcal{H}$ ) or layer depth ( $L$ ) provided different benefits depending on the specific parameterization.

Conclusion and Future Work

This paper provides a comprehensive analysis of the scaling limits in transformer models using DMFT, presenting important insights into the stable convergence of training dynamics in large-scale settings. Future work could extend these findings to different optimizer settings, particularly Adam, and explore the interplay between model size and training duration to optimize compute resources effectively.

The research underscores the nuanced trade-offs between various scaling parameters, painting a detailed picture that can guide the development of more efficient, scalable transformer architectures in the future.