Understanding the Difficulty of Training Transformers (2004.08249v3)

Published 17 Apr 2020 in cs.LG, cs.CL, and stat.ML

Abstract: Transformers have proved effective in many NLP tasks. However, their training requires non-trivial efforts regarding designing cutting-edge optimizers and learning rate schedulers carefully (e.g., conventional SGD fails to train Transformers effectively). Our objective here is to understand $\textit{what complicates Transformer training}$ from both empirical and theoretical perspectives. Our analysis reveals that unbalanced gradients are not the root cause of the instability of training. Instead, we identify an amplification effect that influences training substantially -- for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable, since it amplifies small parameter perturbations (e.g., parameter updates) and results in significant disturbances in the model output. Yet we observe that a light dependency limits the model potential and leads to inferior trained models. Inspired by our analysis, we propose Admin ($\textbf{Ad}$aptive $\textbf{m}$odel $\textbf{in}$itialization) to stabilize stabilize the early stage's training and unleash its full potential in the late stage. Extensive experiments show that Admin is more stable, converges faster, and leads to better performance. Implementations are released at: https://github.com/LiyuanLucasLiu/Transforemr-Clinic.

Authors (5)

Liyuan Liu (49 papers)
Xiaodong Liu (162 papers)
Jianfeng Gao (344 papers)
Weizhu Chen (128 papers)
Jiawei Han (263 papers)

Citations (218)

View on Semantic Scholar

Summary

The paper identifies that Transformer instability mainly stems from residual branch amplification rather than gradient vanishing.
The paper compares Pre-LN and Post-LN architectures, showing that while Pre-LN offers stability, it may compromise ultimate model performance.
The paper proposes Admin initialization as a practical method to stabilize training and achieve superior performance on benchmark datasets.

Understanding the Difficulty of Training Transformers

The paper, "Understanding the Difficulty of Training Transformers," investigates the intricacies involved in training Transformer models, a cornerstone in NLP. Despite their success in numerous applications, training these models presents significant challenges, particularly in the optimizer and learning rate scheduler design. This research provides a comprehensive analysis of the factors complicating Transformer training from both empirical and theoretical perspectives.

Key Insights

Training Instability and Gradients: The paper reveals that unbalanced gradients often complicate Transformer training. Contrary to common assumptions, the authors identify that gradient vanishing is not the root cause of instability in training. Instead, a more profound issue is related to how the layers depend on their residual branches.
Amplification Effect: A critical insight is the amplification effect where heavy dependency on residual branches amplifies minor parameter perturbations, causing significant disturbances in model output. This dependency can lead to unstable training environments, particularly when using the original Post-Layer Normalization (Post-LN) architecture of Transformers.
Comparison of Architectures: Through comparisons between Pre-LN and Post-LN Transformer architectures, the research underscores that while Pre-LN provides more stability, it limits the model's potential, often resulting in inferior performance compared to Post-LN when training does not diverge.
Admin Initialization: The authors propose Admin (Adaptive model initialization) as a stabilization technique. By controlling the layer dependency early in the training phase, Admin facilitates more stable training, faster convergence, and enhanced performance in later stages.

Experimental Results

The experiments conducted on datasets such as IWSLT’14 De-En, WMT’14 En-De, and WMT’14 En-Fr demonstrate Admin's efficacy. Admin not only stabilizes training in 72-layer models on WMT’14 En-Fr, achieving a BLEU score of 43.80, but also achieves superior stability and performance compared to existing methods.

Implications and Future Directions

This research has significant implications for the training of deep neural networks, particularly Transformers. By effectively addressing the instability issues without introducing additional hyper-parameters, Admin opens avenues for training deeper and more robust models. Future work may explore extending Admin's principles to other architectures and developing automated strategies for model adaptation across varied training configurations.

The paper's contributions expand the understanding of Transformer training dynamics and provide practical solutions to improve training stability. Such advancements are pivotal as models continue to scale in depth and complexity, aligning with the ever-growing demand for more capable neural architectures in AI applications.

PDF Markdown

Related Papers

GitHub

GitHub - LiyuanLucasLiu/Transformer-Clinic: Understanding the Difficulty of Training Transformers (327 stars)

Tweets

https://twitter.com/MaheshRam23629/status/1919633542952153162

https://twitter.com/MaheshRam23629/status/1775450347080949913

YouTube

Show All Videos