Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NormFormer: Improved Transformer Pretraining with Extra Normalization (2110.09456v2)

Published 18 Oct 2021 in cs.CL and cs.AI

Abstract: During pretraining, the Pre-LayerNorm transformer suffers from a gradient magnitude mismatch: gradients at early layers are much larger than at later layers. These issues can be alleviated by our proposed NormFormer architecture, which adds three normalization operations to each layer: a Layer Norm after self attention, head-wise scaling of self-attention outputs, and a Layer Norm after the first fully connected layer. The extra operations incur negligible compute cost (+0.4% parameter increase), but improve pretraining perplexity and downstream task performance for both causal and masked LLMs ranging from 125 Million to 2.7 Billion parameters. For example, adding NormFormer on top of our strongest 1.3B parameter baseline can reach equal perplexity 24% faster, or converge 0.27 perplexity better in the same compute budget. This model reaches GPT3-Large (1.3B) zero shot performance 60% faster. For masked LLMing, NormFormer improves fine-tuned GLUE performance by 1.9% on average. Code to train NormFormer models is available in fairseq https://github.com/pytorch/fairseq/tree/main/examples/normformer .

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Sam Shleifer (15 papers)
  2. Jason Weston (130 papers)
  3. Myle Ott (33 papers)
Citations (66)

Summary

Insights into NormFormer: Enhanced Transformer Pretraining via Additional Normalization

NormFormer: Improved Transformer Pretraining with Extra Normalization investigates a critical issue in the Pre-LayerNorm (Pre-LN) transformer architecture: the gradient magnitude mismatch across layers during pretraining. The research introduces an improved architecture, NormFormer, which integrates three additional normalization operations within each transformer layer to address this challenge.

Gradient Magnitude Mismatch in Transformers

Standard transformer architectures have utilized either Post-LN or Pre-LN configurations. The former results in larger gradients for later layers, while recent findings indicate that Pre-LN models exacerbate the opposite issue, with early layers receiving disproportionately larger gradients. This gradient imbalance can destabilize training, particularly under mixed precision conditions, leading to suboptimal performance.

Proposed NormFormer Modifications

The NormFormer architecture incorporates three strategic modifications:

  1. Layer Normalization Post Self-Attention (Post Attn LN): This aims to balance gradient magnitudes across different layers by normalizing outputs after the self-attention module.
  2. Head-Scale Attention: Introducing learned scalar coefficients, this modification adjusts the output magnitude of individual attention heads, refining their contribution to the final output.
  3. Layer Normalization Post First Fully Connected Layer (FFN LN): Additional normalization applied after the first feedforward layer helps to temper gradient magnitudes before they propagate through the network.

These changes introduce a negligible parameter increase (+0.4%), making them efficient in terms of computational cost.

Significant Results and Implications

The NormFormer model demonstrates substantial improvements in both pretraining perplexity and downstream performance. Key outcomes include:

  • A 24% reduction in time required to reach equivalent perplexity levels compared to the strongest 1.3B parameter baseline.
  • Enhanced zero-shot performance, achieving comparability with GPT-3 Large models 60% faster.
  • An average 1.9% improvement in fine-tuned GLUE performance for masked LLMs.

These results underscore the efficacy of the additional normalization layers in mitigating the gradient mismatch while enhancing model efficiency and stability.

Analysis of Gradient Norms

The paper provides a thorough analysis of gradient norms, showing a marked reduction in gradient discrepancies between layers with the NormFormer architecture. The introduction of normalization and scaling operations effectively mitigates the instability caused by these mismatches, enabling the use of larger learning rates and promoting faster convergence.

Future Directions

This research opens several avenues for future exploration. Potential developments could include:

  • Further optimization of the balancing scale between normalization operations and their impact at different layers.
  • Adaptation of NormFormer strategies to other transformer-based architectures or tasks beyond LLMing.
  • Exploration of the integration of NormFormer with alternative initialization methods or training strategies to further leverage its stabilization benefits.

The NormFormer architecture presents a compelling advancement for transformer model pretraining, with its thoughtful integration of additional normalization processes offering meaningful improvements in training efficiency and performance.

Youtube Logo Streamline Icon: https://streamlinehq.com