Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pre-LayerNorm Residual Blocks in Transformers

Updated 20 May 2026
  • Pre-LayerNorm residual blocks are transformer components that apply normalization before sublayer operations to enhance training stability.
  • They improve gradient flow by mitigating vanishing/exploding gradients, ensuring robust learning across deep architectures.
  • Variants like Pre-RMSNorm and Pre-CRMSNorm provide efficiency gains while maintaining the functional and optimization properties of standard Pre-LN blocks.

Pre-LayerNorm (Pre-LN) residual blocks are a foundational architectural choice in transformer models, designed to stabilize training and enhance optimization by positioning normalization upstream of sublayer computations. This design contrasts with Post-LayerNorm placement and has become the prevailing standard in contemporary LLMs and vision transformers due to its effects on gradient flow, optimization stability, and empirical efficiency. Several rigorous lines of research have analyzed the mathematical structure, functional properties, equivalence with normalization variants, and practical implications of the Pre-LN scheme (Jiang et al., 2023, Singhal et al., 13 Nov 2025).

1. Mathematical Structure of Pre-LayerNorm Residual Blocks

In a transformer employing the Pre-LayerNorm paradigm, each block processes an input xℓ∈Rdx_\ell\in\mathbb{R}^d as follows: x~ℓ=LayerNorm(xℓ)\tilde x_\ell = \mathrm{LayerNorm}(x_\ell)

yℓ=Sℓ(x~ℓ),(Sℓ often MLPℓ or Attentionℓ)y_\ell = \mathcal{S}_\ell(\tilde x_\ell), \quad (\mathcal{S}_\ell \text{ often MLP}_\ell \text{ or Attention}_\ell)

xâ„“+1=xâ„“+yâ„“x_{\ell+1} = x_\ell + y_\ell

Here, the LayerNorm operation is defined as: LayerNorm(x)=x−μ(x)11d∥x∥22−μ(x)2+ϵ,μ(x)=1d1Tx,ϵ>0\mathrm{LayerNorm}(x) = \frac{x - \mu(x)\mathbf{1}}{\sqrt{\frac{1}{d}\|x\|_2^2 - \mu(x)^2 + \epsilon}}, \quad \mu(x) = \frac{1}{d}\mathbf{1}^T x, \quad \epsilon>0 After passage through LL such blocks, a final LayerNorm is typically applied: x^=LayerNorm(xL)\hat x = \mathrm{LayerNorm}(x_L) In standard implementations, Pre-LN transformers decompose each block further, applying LayerNorm before both the Multi-Head Self-Attention (MHSA) and Feed-Forward Network (FFN) sublayers: uℓ(1)=LN1(xℓ) mℓ=MHSA(uℓ(1)) yℓ=xℓ+mℓ uℓ(2)=LN2(yℓ) fℓ=FFN(uℓ(2)) xℓ+1=yℓ+fℓ\begin{aligned} u_\ell^{(1)} &= \mathrm{LN}_1(x_\ell)\ m_\ell &= \mathrm{MHSA}(u_\ell^{(1)})\ y_\ell &= x_\ell + m_\ell\ u_\ell^{(2)} &= \mathrm{LN}_2(y_\ell)\ f_\ell &= \mathrm{FFN}(u_\ell^{(2)})\ x_{\ell+1} &= y_\ell + f_\ell \end{aligned} This dual placement enhances optimization stability and modulates both learning and memorization (Singhal et al., 13 Nov 2025).

2. Functional Properties and Gradient Flow

Pre-LN residual blocks address the "vanishing/exploding gradient" issues associated with Post-LN alternatives. For a Pre-LN transformer with NN layers, one can upper-bound the L2L_2-norm of the gradient x~ℓ=LayerNorm(xℓ)\tilde x_\ell = \mathrm{LayerNorm}(x_\ell)0 (with respect to input x~ℓ=LayerNorm(xℓ)\tilde x_\ell = \mathrm{LayerNorm}(x_\ell)1 to the x~ℓ=LayerNorm(xℓ)\tilde x_\ell = \mathrm{LayerNorm}(x_\ell)2-th LN layer) as: x~ℓ=LayerNorm(xℓ)\tilde x_\ell = \mathrm{LayerNorm}(x_\ell)3 where x~ℓ=LayerNorm(xℓ)\tilde x_\ell = \mathrm{LayerNorm}(x_\ell)4 denotes spectral norm, and x~ℓ=LayerNorm(xℓ)\tilde x_\ell = \mathrm{LayerNorm}(x_\ell)5 gathers downstream head Jacobians. Each factor is ensured to be at least x~ℓ=LayerNorm(xℓ)\tilde x_\ell = \mathrm{LayerNorm}(x_\ell)6, meaning gradients do not degrade or explode geometrically as in Post-LN. Notably, the upper bound is largest for early layers and decays monotonically: x~ℓ=LayerNorm(xℓ)\tilde x_\ell = \mathrm{LayerNorm}(x_\ell)7 Furthermore, the norm of the gradient driving genuine learning (x~ℓ=LayerNorm(xℓ)\tilde x_\ell = \mathrm{LayerNorm}(x_\ell)8) dominates the gradient driving memorization of noise (x~ℓ=LayerNorm(xℓ)\tilde x_\ell = \mathrm{LayerNorm}(x_\ell)9): yℓ=Sℓ(x~ℓ),(Sℓ often MLPℓ or Attentionℓ)y_\ell = \mathcal{S}_\ell(\tilde x_\ell), \quad (\mathcal{S}_\ell \text{ often MLP}_\ell \text{ or Attention}_\ell)0 This supports the empirical robustness of Pre-LN transformers across deep and wide architectures (Singhal et al., 13 Nov 2025).

3. Pre-LayerNorm versus RMSNorm and CRMSNorm: Computational Unification

While LayerNorm recenters and rescales vectors, RMSNorm performs only RMS-based rescaling: yℓ=Sℓ(x~ℓ),(Sℓ often MLPℓ or Attentionℓ)y_\ell = \mathcal{S}_\ell(\tilde x_\ell), \quad (\mathcal{S}_\ell \text{ often MLP}_\ell \text{ or Attention}_\ell)1 If yℓ=Sℓ(x~ℓ),(Sℓ often MLPℓ or Attentionℓ)y_\ell = \mathcal{S}_\ell(\tilde x_\ell), \quad (\mathcal{S}_\ell \text{ often MLP}_\ell \text{ or Attention}_\ell)2 is zero mean, yℓ=Sℓ(x~ℓ),(Sℓ often MLPℓ or Attentionℓ)y_\ell = \mathcal{S}_\ell(\tilde x_\ell), \quad (\mathcal{S}_\ell \text{ often MLP}_\ell \text{ or Attention}_\ell)3. Pre-LN transformers allow all main-branch activations to be zero mean by re-centering on the fly: yℓ=Sℓ(x~ℓ),(Sℓ often MLPℓ or Attentionℓ)y_\ell = \mathcal{S}_\ell(\tilde x_\ell), \quad (\mathcal{S}_\ell \text{ often MLP}_\ell \text{ or Attention}_\ell)4 This enables LayerNorm to be algebraically replaced by RMSNorm, with all redundancy in the mean eliminated. Further, any zero-mean vector yℓ=Sℓ(x~ℓ),(Sℓ often MLPℓ or Attentionℓ)y_\ell = \mathcal{S}_\ell(\tilde x_\ell), \quad (\mathcal{S}_\ell \text{ often MLP}_\ell \text{ or Attention}_\ell)5 can be losslessly compressed to its first yℓ=Sℓ(x~ℓ),(Sℓ often MLPℓ or Attentionℓ)y_\ell = \mathcal{S}_\ell(\tilde x_\ell), \quad (\mathcal{S}_\ell \text{ often MLP}_\ell \text{ or Attention}_\ell)6 components; this leads to the "Compressed RMSNorm" (CRMSNorm) variant: yℓ=Sℓ(x~ℓ),(Sℓ often MLPℓ or Attentionℓ)y_\ell = \mathcal{S}_\ell(\tilde x_\ell), \quad (\mathcal{S}_\ell \text{ often MLP}_\ell \text{ or Attention}_\ell)7

yℓ=Sℓ(x~ℓ),(Sℓ often MLPℓ or Attentionℓ)y_\ell = \mathcal{S}_\ell(\tilde x_\ell), \quad (\mathcal{S}_\ell \text{ often MLP}_\ell \text{ or Attention}_\ell)8

Replacing Pre-LN by Pre-RMSNorm or Pre-CRMSNorm produces variants with no change in function and strictly reduced floating-point operations (Jiang et al., 2023).

4. Equivalence Theorems and Reparameterization

Pre-LN, Pre-RMSNorm, and Pre-CRMSNorm transformers are proven to be arithmetically equivalent at both training and inference: yℓ=Sℓ(x~ℓ),(Sℓ often MLPℓ or Attentionℓ)y_\ell = \mathcal{S}_\ell(\tilde x_\ell), \quad (\mathcal{S}_\ell \text{ often MLP}_\ell \text{ or Attention}_\ell)9 This equivalence is established through three key properties:

  • LayerNorm on zero-mean inputs is identical to RMSNorm.
  • Mean-centering can be algebraically absorbed into linear weights/biases (Lemma 1).
  • The xâ„“+1=xâ„“+yâ„“x_{\ell+1} = x_\ell + y_\ell0-to-xâ„“+1=xâ„“+yâ„“x_{\ell+1} = x_\ell + y_\ell1 vector compression on zero-mean activations is lossless; surrounding linear layers can be rewritten correspondingly. Training equivalence is maintained by conceptual "master copy" weights from Pre-LN, with forward and backward passes executed via transformed parameters without affecting gradient trajectories. This unification demonstrates that any Pre-LN transformer can be exchanged for more efficient variants without fine-tuning or loss of function (Jiang et al., 2023).

5. Empirical Findings: Efficiency and Learning Dynamics

LayerNorm accounts for approximately 10–15% of runtime in a Pre-LN block. Replacing Pre-LN with Pre-RMSNorm yields consistent efficiency gains: 1–10% speedup in inference and 1–3% in end-to-end training is observed on Vision Transformer and GPT-3-like benchmarks using A100 GPUs, CPUs, and JAX. Efficiency improvements arise from RMSNorm being 20–60% cheaper than LayerNorm. Pre-CRMSNorm offers up to a further 10% inference speedup when hardware efficiently accommodates the xℓ+1=xℓ+yℓx_{\ell+1} = x_\ell + y_\ell2 compression, though on current GPUs, dimensions are often restored to xℓ+1=xℓ+yℓx_{\ell+1} = x_\ell + y_\ell3, making Pre-CRMSNorm and Pre-RMSNorm nearly identical in speed (Jiang et al., 2023).

Empirically, the role of LayerNorm parameters is pivotal. In Pre-LN models, removing LN parameters (i.e., setting xâ„“+1=xâ„“+yâ„“x_{\ell+1} = x_\ell + y_\ell4, xâ„“+1=xâ„“+yâ„“x_{\ell+1} = x_\ell + y_\ell5) results in catastrophic failure to learn: test accuracy collapses irrecoverably, and memorization persists (xâ„“+1=xâ„“+yâ„“x_{\ell+1} = x_\ell + y_\ell6 of noisy samples are memorized), with a sharp increase in overfitting gap. This underscores the necessity of normalization for both gradient stability and genuine learning in Pre-LN blocks (Singhal et al., 13 Nov 2025).

6. Influence of Early, Middle, and Late Layer Normalization

LayerNorm's impact in Pre-LN blocks is stratified by depth. Removing normalization in early layers leads to the most severe destabilization of learning and highest memorization rates. This is quantitatively supported by the decay in gradient-norm upper bounds from early to late layers. Conversely, in Post-LN models, removing early LN parameters suppresses memorization and restores genuine label recovery, demonstrating an architectural dichotomy in the function of layer normalization (Singhal et al., 13 Nov 2025). Practical recommendations include preserving LN parameters in early Pre-LN layers to ensure optimization stability and generalization, and preferring Pre-LN design over Post-LN in new architectures where stable training is critical.

7. Summary Table: Pre-LN, Pre-RMSNorm, and Pre-CRMSNorm

Variant Normalization Operation Efficiency (%)
Pre-LayerNorm xâ„“+1=xâ„“+yâ„“x_{\ell+1} = x_\ell + y_\ell7 Baseline
Pre-RMSNorm xℓ+1=xℓ+yℓx_{\ell+1} = x_\ell + y_\ell8 1–10% speedup
Pre-CRMSNorm xℓ+1=xℓ+yℓx_{\ell+1} = x_\ell + y_\ell9, LayerNorm(x)=x−μ(x)11d∥x∥22−μ(x)2+ϵ,μ(x)=1d1Tx,ϵ>0\mathrm{LayerNorm}(x) = \frac{x - \mu(x)\mathbf{1}}{\sqrt{\frac{1}{d}\|x\|_2^2 - \mu(x)^2 + \epsilon}}, \quad \mu(x) = \frac{1}{d}\mathbf{1}^T x, \quad \epsilon>00 is compressed vector Up to 10% further*

*When hardware efficiently utilizes LayerNorm(x)=x−μ(x)11d∥x∥22−μ(x)2+ϵ,μ(x)=1d1Tx,ϵ>0\mathrm{LayerNorm}(x) = \frac{x - \mu(x)\mathbf{1}}{\sqrt{\frac{1}{d}\|x\|_2^2 - \mu(x)^2 + \epsilon}}, \quad \mu(x) = \frac{1}{d}\mathbf{1}^T x, \quad \epsilon>01 dimension; in practice often similar to Pre-RMSNorm.

The equivalence of these variants enables transformer designers to directly substitute more efficient Pre-RMSNorm or Pre-CRMSNorm blocks in existing Pre-LN architectures, preserving all functional, optimization, and learning-theoretic properties (Jiang et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pre-LayerNorm Residual Blocks.