Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding and Improving Layer Normalization (1911.07013v1)

Published 16 Nov 2019 in cs.LG, cs.CL, and stat.ML

Abstract: Layer normalization (LayerNorm) is a technique to normalize the distributions of intermediate layers. It enables smoother gradients, faster training, and better generalization accuracy. However, it is still unclear where the effectiveness stems from. In this paper, our main contribution is to take a step further in understanding LayerNorm. Many of previous studies believe that the success of LayerNorm comes from forward normalization. Unlike them, we find that the derivatives of the mean and variance are more important than forward normalization by re-centering and re-scaling backward gradients. Furthermore, we find that the parameters of LayerNorm, including the bias and gain, increase the risk of over-fitting and do not work in most cases. Experiments show that a simple version of LayerNorm (LayerNorm-simple) without the bias and gain outperforms LayerNorm on four datasets. It obtains the state-of-the-art performance on En-Vi machine translation. To address the over-fitting problem, we propose a new normalization method, Adaptive Normalization (AdaNorm), by replacing the bias and gain with a new transformation function. Experiments show that AdaNorm demonstrates better results than LayerNorm on seven out of eight datasets.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jingjing Xu (80 papers)
  2. Xu Sun (194 papers)
  3. Zhiyuan Zhang (129 papers)
  4. Guangxiang Zhao (17 papers)
  5. Junyang Lin (99 papers)
Citations (282)

Summary

Insights on "Understanding and Improving Layer Normalization"

This paper provides a robust analysis of Layer Normalization (LayerNorm), a frequently used technique in training neural networks, aiming at understanding its underlying mechanics and proposing improvements. The paper asserts that while forward normalization has traditionally been credited with LayerNorm's success, the derivatives of the mean and variance play a more critical role by adjusting backward gradients.

Key Findings

  • Backward Gradient Normalization: The paper challenges the common belief that forward normalization is primarily responsible for the effectiveness of LayerNorm. Instead, it argues that the true advantage arises from the derivatives of the mean and variance, which facilitate the re-centering and re-scaling of backward gradients. This insight is substantiated by introducing "DetachNorm", a variant that detaches these derivatives, resulting in inferior performance compared to standard LayerNorm.
  • Bias and Gain Parameters: The research highlights that the bias and gain parameters, typically used in LayerNorm to increase model expressiveness, might contribute to overfitting and do not consistently enhance performance. By simplifying LayerNorm to exclude these parameters, resulting in a "LayerNorm-simple" version, the authors demonstrate improved outcomes on certain datasets, including achieving state-of-the-art results in English-Vietnamese machine translation.
  • Adaptive Normalization (AdaNorm): To address the identified challenge of overfitting associated with bias and gain, the authors introduce Adaptive Normalization (AdaNorm). AdaNorm replaces bias and gain with a new transform that customizes scaling weights based on inputs, leading to superior performance relative to LayerNorm in seven out of eight datasets examined.

Experimental Approach and Results

The paper rigorously tests its hypotheses across a variety of tasks, from machine translation to image classification, utilizing models such as Transformers—both standard and Transformer-XL—and various neural network configurations. Notable is the performance of LayerNorm-simple, which consistently matched or outperformed the standard LayerNorm in several tasks. AdaNorm was particularly effective, indicating that adapting to input variations improves model robustness and generalization.

Implications and Future Directions

The paper's findings prompt a reevaluation of LayerNorm's design, emphasizing the importance of backward gradient management over forward normalization. By challenging conventional wisdom around model parameters like bias and gain, it opens avenues for exploring alternative normalization methods that preserve gradient stability.

AdaNorm's success suggests a fertile ground for further research into dynamic adjustment mechanisms in model training, potentially extending beyond normalization layers. Exploring diverse forms of adaptive transformations could encourage more generalized and efficient learning processes.

In summary, this paper contributes significantly to the understanding and improvement of normalization techniques in deep learning. By shifting focus from simple normalization of forward layer inputs to the nuanced effects of gradient normalization, it provides a clearer path toward optimizing neural network training paradigms.