Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding the Difficulty of Training Transformers (2004.08249v3)

Published 17 Apr 2020 in cs.LG, cs.CL, and stat.ML

Abstract: Transformers have proved effective in many NLP tasks. However, their training requires non-trivial efforts regarding designing cutting-edge optimizers and learning rate schedulers carefully (e.g., conventional SGD fails to train Transformers effectively). Our objective here is to understand $\textit{what complicates Transformer training}$ from both empirical and theoretical perspectives. Our analysis reveals that unbalanced gradients are not the root cause of the instability of training. Instead, we identify an amplification effect that influences training substantially -- for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable, since it amplifies small parameter perturbations (e.g., parameter updates) and results in significant disturbances in the model output. Yet we observe that a light dependency limits the model potential and leads to inferior trained models. Inspired by our analysis, we propose Admin ($\textbf{Ad}$aptive $\textbf{m}$odel $\textbf{in}$itialization) to stabilize stabilize the early stage's training and unleash its full potential in the late stage. Extensive experiments show that Admin is more stable, converges faster, and leads to better performance. Implementations are released at: https://github.com/LiyuanLucasLiu/Transforemr-Clinic.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Liyuan Liu (49 papers)
  2. Xiaodong Liu (162 papers)
  3. Jianfeng Gao (344 papers)
  4. Weizhu Chen (128 papers)
  5. Jiawei Han (263 papers)
Citations (218)

Summary

Understanding the Difficulty of Training Transformers

The paper, "Understanding the Difficulty of Training Transformers," investigates the intricacies involved in training Transformer models, a cornerstone in NLP. Despite their success in numerous applications, training these models presents significant challenges, particularly in the optimizer and learning rate scheduler design. This research provides a comprehensive analysis of the factors complicating Transformer training from both empirical and theoretical perspectives.

Key Insights

  1. Training Instability and Gradients: The paper reveals that unbalanced gradients often complicate Transformer training. Contrary to common assumptions, the authors identify that gradient vanishing is not the root cause of instability in training. Instead, a more profound issue is related to how the layers depend on their residual branches.
  2. Amplification Effect: A critical insight is the amplification effect where heavy dependency on residual branches amplifies minor parameter perturbations, causing significant disturbances in model output. This dependency can lead to unstable training environments, particularly when using the original Post-Layer Normalization (Post-LN) architecture of Transformers.
  3. Comparison of Architectures: Through comparisons between Pre-LN and Post-LN Transformer architectures, the research underscores that while Pre-LN provides more stability, it limits the model's potential, often resulting in inferior performance compared to Post-LN when training does not diverge.
  4. Admin Initialization: The authors propose Admin (Adaptive model initialization) as a stabilization technique. By controlling the layer dependency early in the training phase, Admin facilitates more stable training, faster convergence, and enhanced performance in later stages.

Experimental Results

The experiments conducted on datasets such as IWSLT’14 De-En, WMT’14 En-De, and WMT’14 En-Fr demonstrate Admin's efficacy. Admin not only stabilizes training in 72-layer models on WMT’14 En-Fr, achieving a BLEU score of 43.80, but also achieves superior stability and performance compared to existing methods.

Implications and Future Directions

This research has significant implications for the training of deep neural networks, particularly Transformers. By effectively addressing the instability issues without introducing additional hyper-parameters, Admin opens avenues for training deeper and more robust models. Future work may explore extending Admin's principles to other architectures and developing automated strategies for model adaptation across varied training configurations.

The paper's contributions expand the understanding of Transformer training dynamics and provide practical solutions to improve training stability. Such advancements are pivotal as models continue to scale in depth and complexity, aligning with the ever-growing demand for more capable neural architectures in AI applications.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com