PowerNorm: Rethinking Batch Normalization in Transformers (2003.07845v2)

Published 17 Mar 2020 in cs.CL and cs.LG

Abstract: The standard normalization method for neural network (NN) models used in NLP is layer normalization (LN). This is different than batch normalization (BN), which is widely-adopted in Computer Vision. The preferred use of LN in NLP is principally due to the empirical observation that a (naive/vanilla) use of BN leads to significant performance degradation for NLP tasks; however, a thorough understanding of the underlying reasons for this is not always evident. In this paper, we perform a systematic study of NLP transformer models to understand why BN has a poor performance, as compared to LN. We find that the statistics of NLP data across the batch dimension exhibit large fluctuations throughout training. This results in instability, if BN is naively implemented. To address this, we propose Power Normalization (PN), a novel normalization scheme that resolves this issue by (i) relaxing zero-mean normalization in BN, (ii) incorporating a running quadratic mean instead of per batch statistics to stabilize fluctuations, and (iii) using an approximate backpropagation for incorporating the running statistics in the forward pass. We show theoretically, under mild assumptions, that PN leads to a smaller Lipschitz constant for the loss, compared with BN. Furthermore, we prove that the approximate backpropagation scheme leads to bounded gradients. We extensively test PN for transformers on a range of NLP tasks, and we show that it significantly outperforms both LN and BN. In particular, PN outperforms LN by 0.4/0.6 BLEU on IWSLT14/WMT14 and 5.6/3.0 PPL on PTB/WikiText-103. We make our code publicly available at \url{https://github.com/sIncerass/powernorm}.

Citations (16)

View on Semantic Scholar

Summary

The paper introduces Power Normalization (PN) as an innovative alternative to conventional batch normalization in transformer models.
It replaces per-batch variance with a quadratic mean and relaxes the zero-mean requirement, stabilizing training for NLP tasks.
Experimental results show PN improves BLEU scores and reduces perplexity, demonstrating its practical benefits in machine translation and language modeling.

PowerNorm: Rethinking Batch Normalization in Transformers

The paper "PowerNorm: Rethinking Batch Normalization in Transformers" by Sheng Shen et al. addresses a key limitation in the application of batch normalization (BN) for transformers in NLP. In contrast to computer vision (CV), where BN is widely used, NLP has traditionally adopted layer normalization (LN) due to the performance degradation observed when using BN. This paper provides both an in-depth analysis of this issue and a proposed innovation called Power Normalization (PN), which aims to enhance transformer models without the shortcomings of BN.

Overview of Findings

The authors identify that NLP data presents large fluctuations in batch statistics during training, which is a primary cause for the ineffectiveness of BN in this domain. These fluctuations lead to instability when BN is applied naively. To address this, the authors introduce Power Normalization (PN) with key modifications:

Relaxation of Zero-Mean Requirement: PN allows for a relaxation of the zero-mean normalization requirement found in traditional BN.
Quadratic Mean Replacement: It proposes the use of a quadratic mean instead of per-batch variance statistics to stabilize these fluctuations.
Approximate Backpropagation: PN incorporates an approximate backpropagation technique to handle running statistics in the forward pass effectively.

The paper makes strong theoretical contributions by showing that, under certain assumptions, PN results in a smaller Lipschitz constant for the loss than BN. Additionally, it proves that the approximate backpropagation scheme leads to bounded gradients, which is critical for the convergence of learning algorithms.

Experimental Results

Extensive testing across a range of NLP tasks highlights the efficacy of PN:

Machine Translation: PN outperforms LN by 0.4 BLEU on IWSLT14 and 0.6 BLEU on WMT14 benchmarks.
LLMing: On PTB and WikiText-103, PN exceeds LN by 5.6 PPL and 3.0 PPL, respectively.

These improvements are achieved without altering hyperparameters, indicating that PN provides robust performance gains in transformer architectures.

Implications and Future Directions

Practically, the implementation of PN could lead to more efficient and accurate NLP models by resolving issues with statistical variability across batches. The proposed methodology might also translate to stabilization of training and enhance generalization capacity, particularly in tasks sensitive to batch statistics like LLMing.

Theoretically, this work opens up new directions in understanding normalization techniques, suggesting that other domains might benefit from similar approaches. Future research could explore extensions of Power Normalization to address other neural network architectures or investigate the integration of PN with pre-trained models, potentially leading to improvements in transfer learning scenarios.

In summary, this paper presents a nuanced understanding of the limitations of BN in NLP transformers and offers a promising alternative through Power Normalization. By blending theoretical insights with empirical validation, it pushes the boundaries of neural network normalization techniques in a manner that could considerably enhance NLP model robustness and performance.

PDF Markdown

Related Papers

GitHub

GitHub - sIncerass/powernorm: [ICML 2020] code for "PowerNorm: Rethinking Batch Normalization in Transformers" https://arxiv.org/abs/2003.07845 (119 stars)

Tweets

https://twitter.com/arankomatsuzaki/status/1282472320678813697

YouTube

Show All Videos