Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transformers without Tears: Improving the Normalization of Self-Attention (1910.05895v2)

Published 14 Oct 2019 in cs.CL, cs.LG, and stat.ML

Abstract: We evaluate three simple, normalization-centric changes to improve Transformer training. First, we show that pre-norm residual connections (PreNorm) and smaller initializations enable warmup-free, validation-based training with large learning rates. Second, we propose $\ell_2$ normalization with a single scale parameter (ScaleNorm) for faster training and better performance. Finally, we reaffirm the effectiveness of normalizing word embeddings to a fixed length (FixNorm). On five low-resource translation pairs from TED Talks-based corpora, these changes always converge, giving an average +1.1 BLEU over state-of-the-art bilingual baselines and a new 32.8 BLEU on IWSLT'15 English-Vietnamese. We observe sharper performance curves, more consistent gradient norms, and a linear relationship between activation scaling and decoder depth. Surprisingly, in the high-resource setting (WMT'14 English-German), ScaleNorm and FixNorm remain competitive but PreNorm degrades performance.

Citations (206)

Summary

  • The paper demonstrates that PreNorm enables warmup-free training and large learning rates, significantly improving performance in low-resource settings.
  • The paper shows that replacing LayerNorm with ScaleNorm reduces computational overhead by normalizing activations to a single learned length without performance loss.
  • The paper confirms that FixNorm for word embeddings stabilizes training and leads to faster convergence, resulting in improved BLEU scores.

Insights on "Transformers without Tears: Improving the Normalization of Self-Attention"

The paper "Transformers without Tears: Improving the Normalization of Self-Attention" presents a focused examination and enhancement of Transformer models within the domain of neural machine translation (NMT) through targeted improvements in the normalization procedures used during their training. This analysis evaluates three primary changes—each pivoting around normalization techniques—and demonstrates their efficacy in both low-resource translation settings and, to some extent, high-resource settings.

Key Contributions

The paper articulates three main alterations to standard Transformer model training:

  1. Pre-Norm Residual Connections (PreNorm): This method involves applying layer normalization before the sublayer in the residual module, as opposed to the conventional post-layer normalization (PostNorm). PreNorm is shown to enable warmup-free and validation-based training, utilizing large learning rates without requiring a large batch size. This effectively mitigates the instability issues previously observed in PostNorm models, particularly in low-resource settings.
  2. 2\ell_2 Normalization with ScaleNorm: The authors propose ScaleNorm as a replacement for LayerNorm. ScaleNorm offers a simpler, faster alternative, which normalizes activation vectors to a single learned length. This reduces the computational and memory overhead associated with LayerNorm by requiring only a single learned parameter rather than two per dimension, thereby improving training speed without performance loss in either low-resource or high-resource settings.
  3. FixNorm for Word Embeddings: The normalization of word embeddings to a fixed length was reaffirmed as an effective strategy, aiding in training stability and model convergence.

Performance Improvements

In empirical evaluations, these modifications together improved the average BLEU score by 1.1 points over state-of-the-art baselines on several low-resource translation tasks. For instance, on IWSLT English-Vietnamese, the new approach achieved 32.8 BLEU. Additionally, substantial improvements were noted in the consistency of gradient norms and performance curves.

However, it was found that while the PreNorm approach excels in low-resource contexts, its performance in high-resource settings, such as WMT English-German, does not surpass PostNorm. ScaleNorm and FixNorm, on the other hand, remained competitive across both setups.

Theoretical and Practical Implications

The introduction of ScaleNorm indicates a shift toward more efficient normalization methods that could become preferable over traditional techniques like LayerNorm in standard architectures, given its reduced complexity and equivalent effectiveness. The PreNorm adaptation highlights the nuances of Transformer design, suggesting a reevaluation of normalization strategies based on resource availability and model depth. This could influence how future Transformer architectures are conceived, especially as deeper and more resource-efficient models become preferable across various applications.

Future Directions

Going forward, these insights could drive further investigation on hybrid normalizations and optimizers like RAdam, which may enhance Transformer training without intricate learning rate schedules. Additionally, exploring techniques like Fixup initialization in conjunction with these normalization strategies could revolutionize training deep networks without needing normalization, offering a pathway to simpler yet robust architectures.

The paper contributes a significant evaluation of normalization methods not just in technical terms, but in practical implications, helping refine the training of models in varying resource settings and prompting further exploration into normalization as a fundamental aspect of neural model architecture.