On Layer Normalization in the Transformer Architecture (2002.04745v2)

Published 12 Feb 2020 in cs.LG, cs.CL, and stat.ML

Abstract: The Transformer is widely used in natural language processing tasks. To train a Transformer however, one usually needs a carefully designed learning rate warm-up stage, which is shown to be crucial to the final performance but will slow down the optimization and bring more hyper-parameter tunings. In this paper, we first study theoretically why the learning rate warm-up stage is essential and show that the location of layer normalization matters. Specifically, we prove with mean field theory that at initialization, for the original-designed Post-LN Transformer, which places the layer normalization between the residual blocks, the expected gradients of the parameters near the output layer are large. Therefore, using a large learning rate on those gradients makes the training unstable. The warm-up stage is practically helpful for avoiding this problem. On the other hand, our theory also shows that if the layer normalization is put inside the residual blocks (recently proposed as Pre-LN Transformer), the gradients are well-behaved at initialization. This motivates us to remove the warm-up stage for the training of Pre-LN Transformers. We show in our experiments that Pre-LN Transformers without the warm-up stage can reach comparable results with baselines while requiring significantly less training time and hyper-parameter tuning on a wide range of applications.

View on arXiv

Authors (10)

Ruibin Xiong (5 papers)
Yunchang Yang (6 papers)
Di He (108 papers)
Kai Zheng (134 papers)
Shuxin Zheng (32 papers)
Chen Xing (31 papers)
Huishuai Zhang (64 papers)
Yanyan Lan (87 papers)
Liwei Wang (239 papers)
Tie-Yan Liu (242 papers)

Citations (822)

View on Semantic Scholar

Summary

An Examination of Layer Normalization in the Transformer Architecture

The paper "On Layer Normalization in the Transformer Architecture" provides an in-depth analysis of how the placement of layer normalization (LN) impacts the training dynamics of Transformer models. The Transformer, a widely adopted architecture in NLP tasks, typically requires a carefully managed learning rate warm-up stage to stabilize training. This research explores two variants of the Transformer: the commonly used Post-LN Transformer, where layer normalization is applied outside the residual blocks, and the Pre-LN Transformer, where it is applied inside the residual blocks.

Theoretical Insights

The authors utilize mean field theory to demonstrate that in the Post-LN configuration, the gradients of parameters near the output layer are large at initialization. This gradient scaling necessitates a gradual learning rate warm-up to avoid instability during training. Conversely, in the Pre-LN Transformer, the gradients are well-behaved even at initialization, suggesting that the learning rate warm-up stage could potentially be eliminated.

According to the authors' analysis, the gradient norm for the last layer in the Post-LN Transformer is of the order $O(d \ln d)$ , where $d$ is the dimensionality of the hidden representations. For the Pre-LN Transformer, the gradient norm decreases with the depth $L$ , following a $O\left(d (\ln d) / \sqrt{L}\right)$ relationship. This implies that deeper Post-LN Transformers face more severe gradient issues compared to their Pre-LN counterparts.

Empirical Validation

Extensive experiments are conducted across multiple tasks, including IWSLT14 German-English translation, WMT14 English-German translation, and BERT pre-training tasks. For translation tasks, the Pre-LN Transformer without a warm-up stage achieves results comparable to the Post-LN Transformer with warm-up, significantly reducing training time and hyper-parameter tuning complexity.

On the IWSLT14 De-En task, the Pre-LN Transformer reaches a validation BLEU score of around 34 without any warm-up, whereas the Post-LN Transformer requires a warm-up stage to achieve similar performance. In the more computationally intensive WMT14 En-De task, the Pre-LN Transformer also matches the Post-LN Transformer's performance without necessitating extensive warm-up stages. Additionally, experiments with BERT show that the Pre-LN BERT converges faster than the Post-LN BERT on both pre-training and downstream tasks, corroborating the theoretical insights regarding gradient behavior.

Implications and Future Directions

These findings have significant implications for the training of large-scale NLP models. Removing the learning rate warm-up stage simplifies the training process, reduces computational overhead, and enables faster convergence. This is particularly valuable for large models like BERT, potentially reducing the cost and time required for state-of-the-art model training.

In a broader context, this research adds to the understanding of optimization in deep learning models and the role of gradient scaling in model stability. Future work could explore alternative normalization strategies or the development of new initialization techniques to further enhance the training efficiency of Transformer models. Additionally, examining the interplay between gradient norms and other recently proposed architectures, such as sparse Transformers, could provide deeper insights into optimizing deep learning models.

Conclusion

This paper underscores the critical role of layer normalization placement in Transformer architectures and its effect on optimization dynamics. By demonstrating that the Pre-LN Transformer can be efficiently trained without a learning rate warm-up, this research offers a practical pathway to more efficient and cost-effective training of large-scale NLP models. Theoretical and empirical evidence presented supports the potential shift towards Pre-LN configurations, fostering a better understanding of gradient behavior in deep learning. Future investigations might further refine these approaches, continuing to enhance model training efficiency and stability.