Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Stabilizing Transformer Training by Preventing Attention Entropy Collapse (2303.06296v2)

Published 11 Mar 2023 in cs.LG, cs.AI, cs.CL, cs.CV, and stat.ML

Abstract: Training stability is of great importance to Transformers. In this work, we investigate the training dynamics of Transformers by examining the evolution of the attention layers. In particular, we track the attention entropy for each attention head during the course of training, which is a proxy for model sharpness. We identify a common pattern across different architectures and tasks, where low attention entropy is accompanied by high training instability, which can take the form of oscillating loss or divergence. We denote the pathologically low attention entropy, corresponding to highly concentrated attention scores, as $\textit{entropy collapse}$. As a remedy, we propose $\sigma$Reparam, a simple and efficient solution where we reparametrize all linear layers with spectral normalization and an additional learned scalar. We demonstrate that $\sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training. Additionally, we prove a tight lower bound of the attention entropy, which decreases exponentially fast with the spectral norm of the attention logits, providing additional motivation for our approach. We conduct experiments with $\sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and LLMing tasks. We show that $\sigma$Reparam provides stability and robustness with respect to the choice of hyperparameters, going so far as enabling training (a) a Vision Transformer {to competitive performance} without warmup, weight decay, layer normalization or adaptive optimizers; (b) deep architectures in machine translation and (c) speech recognition to competitive performance without warmup and adaptive optimizers. Code is available at \url{https://github.com/apple/ml-sigma-reparam}.

Citations (47)

Summary

  • The paper identifies that attention entropy collapse causes unstable Transformer training with oscillating or diverging loss values.
  • The proposed σReparam method reparameterizes linear layers using spectral normalization and a learnable scalar to prevent entropy collapse.
  • Empirical results across vision, translation, and language tasks demonstrate that σReparam stabilizes training and simplifies model architectures.

Stabilizing Transformer Training by Preventing Attention Entropy Collapse

The paper "Stabilizing Transformer Training by Preventing Attention Entropy Collapse" provides an in-depth analysis of a prevalent issue in Transformer networks known as attention entropy collapse. The authors identify that during the training of Transformer models, particularly under unstable conditions, the entropy of attention scores collapses, leading to high training instability. This instability manifests as oscillating or diverging loss values, which can significantly impair model performance.

Summary of Key Findings

  1. Attention Entropy Collapse: The researchers track attention entropy over the course of Transformer training and establish that low entropy is synonymous with training instability. This relationship is consistent across various architectures and tasks. They define this pathological state as attention entropy collapse, associated with highly concentrated attention scores.
  2. Proposed Solution - σ\sigmaReparam: The authors propose a novel method named σ\sigmaReparam. This involves a simple reparameterization where all linear layers are modified with spectral normalization and an additional learnable scalar. The σ\sigmaReparam method is shown to effectively prevent entropy collapse, enhance training stability, and is robust to changes in hyperparameters.
  3. Analytical Support: A theoretical framework is presented, demonstrating a tight lower bound of attention entropy that decreases exponentially with the spectral norm of the attention logits. This provides fundamental support to the σ\sigmaReparam approach and showcases the need for controlling the spectral norms of the key/query projection matrices.

Empirical Evaluation

The researchers conducted extensive experiments across multiple tasks: image classification, self-supervised learning, machine translation, speech recognition, and LLMing. The empirical results are noteworthy:

  • In vision tasks, σ\sigmaReparam allowed for the training of Vision Transformers (ViTs) without several traditional stabilizers such as learning rate warmup, weight decay, and adaptive optimizers, while still achieving competitive performance.
  • For self-supervised learning, the method demonstrated substantial improvements in stability and robustness in SimCLR training.
  • In machine translation, σ\sigmaReparam enabled stable training of much deeper Transformer architectures than traditionally feasible, resolving vanishing gradients and stability problems.
  • In speech recognition tasks, the method allowed the abandonment of adaptive optimizers without compromising performance, a significant enhancement in simplifying model training procedures.
  • For LLMing, models trained with σ\sigmaReparam achieved competitive performance without Layer Normalization, indicating potential simplifications in training procedures.

Implications and Future Directions

The insights from this paper have broad implications for the optimization of Transformer models. By preventing entropy collapse, σ\sigmaReparam not only stabilizes training but also simplifies existing recipes, making advanced Transformer architectures more accessible and easier to train. The findings suggest pathways to deeper theoretical exploration, particularly regarding the causal relationship between entropy collapse and training instability. Additionally, further research into integrating σ\sigmaReparam with other cutting-edge techniques could unleash new performance thresholds in Transformer-based applications.

σ\sigmaReparam's generality and effectiveness across various domains suggest its potential as a go-to tool for enhancing the stability of Transformer models. However, its interactions with other methods, like specific initialization schemes or normalization techniques, warrant detailed exploration. As the AI community continues to push the boundaries of deep learning, contributions such as these pave the way for robust and efficient training of ever more complex models.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub