LayerNorm After First Feed-Forward Linear Layer

Updated 8 May 2026

The paper demonstrates that placing LayerNorm after the first feed-forward linear layer stabilizes gradient flow, improving convergence and lowering validation perplexity.
Empirical results show that post-activation LayerNorm achieves neutral initialization and minimizes class bias across transformer layers.
The method enforces geometric constraints that reduce effective Bayesian parameters, simplifying model complexity and accelerating training.

LayerNorm After First Feed-Forward Linear Layer refers to the architectural practice of inserting a Layer Normalization (LayerNorm) operation immediately after the initial linear transformation in a feed-forward block (typically $W_1x + b_1$ ), often before or after the nonlinear activation function. This design choice has concrete mathematical, optimization, geometric, and initialization implications in deep neural networks, especially in transformer and MLP-based architectures.

1. Mathematical Formulation and Placement

The canonical setting for "LayerNorm after first feed-forward linear layer" appears in both standard MLPs and Transformer FFN blocks. Consider an input $x \in \mathbb{R}^d$ , and a two-layer FFN:

$\begin{align*} a &= W_1x + b_1 \ h &= \phi(a) \ y &= W_2h + b_2 \end{align*}$

Layer Normalization can be interposed:

After the first linear layer: $h = \phi(\text{LayerNorm}(W_1x + b_1))$
After activation: $h = \text{LayerNorm}(\phi(W_1x + b_1))$

In "NormFormer: Improved Transformer Pretraining with Extra Normalization," the extra LayerNorm is applied after the activation within the FFN: $x_\ell \rightarrow \text{LN}(x_\ell) \rightarrow W_1 + b_1 \rightarrow \text{GELU}(~) \rightarrow \mathbf{LN}(~) \rightarrow W_2 + b_2 \rightarrow x_{\ell+1} = x_\ell + \cdots$ with

$\text{LayerNorm}(h) = \gamma \circ \frac{h - \mathbb{E}[h]}{\sqrt{\mathrm{Var}[h] + \epsilon}} + \beta$

where learnable scale $\gamma$ is initialized to $1$, shift $\beta$ to $x \in \mathbb{R}^d$ 0, and $x \in \mathbb{R}^d$ 1 is a small constant (e.g., $x \in \mathbb{R}^d$ 2) (Shleifer et al., 2021).

2. Impact on Optimization and Gradient Flow

In Pre-LayerNorm Transformers, gradients to early layers are much larger than to deeper layers, causing a gradient magnitude mismatch and inefficient utilization of parameters. The insertion of an extra LayerNorm after the first feed-forward (linear+GELU) addresses this by:

Downscaling gradients in early FFN layers and upscaling them in later layers.
Aligning the magnitude of gradient updates across all layers, which improves both convergence speed and final validation perplexity.

Empirical evidence from (Shleifer et al., 2021) demonstrates:

The gradient norm band is significantly tightened across layers.
Removing this single FFN LayerNorm (with other modifications held) raises validation perplexity (e.g., $x \in \mathbb{R}^d$ 3 for a 125M param LLM), confirming its unique utility.

3. Effects on Initialization and Predictive Bias

Normalization placement fundamentally alters the distribution of network outputs at initialization. As rigorously studied in (Francazi et al., 16 May 2025):

Norm Before Activation ("Linear → LayerNorm → ReLU") produces prejudiced initializations: initial predictions are biased toward a subset of classes, with per-class output distributions ( $x \in \mathbb{R}^d$ 4) spread toward the extremes.
Norm After Activation ("Linear → ReLU → LayerNorm") yields neutral initializations: all output logits are symmetrically distributed, with $x \in \mathbb{R}^d$ 5 tightly centered at $x \in \mathbb{R}^d$ 6 (where $x \in \mathbb{R}^d$ 7 is the number of classes), regardless of depth.

This is mathematically guaranteed by LayerNorm's per-example feature centering, eliminating the mean fluctuation term from the output logit distribution at all depths and collapsing inter-initialization variance ( $x \in \mathbb{R}^d$ 8).

Empirical validation (Francazi et al., 16 May 2025):

In 20-layer MLPs, "LN after ReLU" produces uniform class distributions at initialization ( $x \in \mathbb{R}^d$ 9 for $\begin{align*} a &= W_1x + b_1 \ h &= \phi(a) \ y &= W_2h + b_2 \end{align*}$ 0), while "LN before ReLU" amplifies bias with depth.
On CIFAR-10, the neutralization effect persists in real structured datasets.

4. Geometric and Bayesian Complexity Costs

LayerNorm imposes strict geometric constraints on the activations it normalizes. By mean-centering, it projects representations to a codimension-1 hyperplane ( $\begin{align*} a &= W_1x + b_1 \ h &= \phi(a) \ y &= W_2h + b_2 \end{align*}$ 1). This constrains the downstream weight matrix $\begin{align*} a &= W_1x + b_1 \ h &= \phi(a) \ y &= W_2h + b_2 \end{align*}$ 2 and reduces the Local Learning Coefficient (LLC) (Chun, 28 Mar 2026), or effective Bayesian parameter count, by exactly $\begin{align*} a &= W_1x + b_1 \ h &= \phi(a) \ y &= W_2h + b_2 \end{align*}$ 3: $\begin{align*} a &= W_1x + b_1 \ h &= \phi(a) \ y &= W_2h + b_2 \end{align*}$ 4 where $\begin{align*} a &= W_1x + b_1 \ h &= \phi(a) \ y &= W_2h + b_2 \end{align*}$ 5 is the output dimension of the subsequent weight matrix. No such drop occurs for RMSNorm.

Controlled experiments confirm that this LLC reduction is:

Exact and binary for affine (flat) hyperplanes created by LayerNorm.
Absent for full-dimensional spheres (RMSNorm) or for non-affine, curved surfaces.
Robust across varying $\begin{align*} a &= W_1x + b_1 \ h &= \phi(a) \ y &= W_2h + b_2 \end{align*}$ 6, $\begin{align*} a &= W_1x + b_1 \ h &= \phi(a) \ y &= W_2h + b_2 \end{align*}$ 7 and in presence of affine bias with simplex data.

This structural reduction in model complexity occurs before any training and reflects true reductions in the “observable” directions in parameter space.

5. Comparison to Standard and Alternative Normalization Placements

Baseline transformer architectures ("Pre-LN" and "Post-LN") insert only one LayerNorm per FFN: either immediately before the sublayer or after the residual addition (Shen, 2022). No LayerNorm is present between the two FFN linear layers ( $\begin{align*} a &= W_1x + b_1 \ h &= \phi(a) \ y &= W_2h + b_2 \end{align*}$ 8 and $\begin{align*} a &= W_1x + b_1 \ h &= \phi(a) \ y &= W_2h + b_2 \end{align*}$ 9) by default.

FoundationLayerNorm (Shen, 2022): Scales the residual input by $h = \phi(\text{LayerNorm}(W_1x + b_1))$ 0 before post-FFN LayerNorm, enabling deep (1,000 layer) training, but does not place an extra LayerNorm after the first FFN linear. No empirical evidence is shown for gradient norm balancing at the level achieved by NormFormer's FFN LN insertion.
RMSNorm: Projects onto a sphere, preserving the LLC ( $h = \phi(\text{LayerNorm}(W_1x + b_1))$ 1), and does not enforce mean centering.

This highlights the uniqueness of the LayerNorm-after-FFN design used in NormFormer (Shleifer et al., 2021) and the "LN after ReLU" variant (Francazi et al., 16 May 2025) for balancing gradient flow, promoting neutrality, and structurally reducing complexity.

6. Implementation, Initialization, and Practical Training Effects

Implementation of LayerNorm after the first FFN linear uses framework-native fused LayerNorm modules with standard parameters ( $h = \phi(\text{LayerNorm}(W_1x + b_1))$ 2 for numerical stability, per-feature $h = \phi(\text{LayerNorm}(W_1x + b_1))$ 3, $h = \phi(\text{LayerNorm}(W_1x + b_1))$ 4). No special initialization or training tricks are needed (Shleifer et al., 2021).

This extra normalization has measurable practical benefits:

Low computational overhead (∼ $h = \phi(\text{LayerNorm}(W_1x + b_1))$ 5 in parameters).
Faster pretraining convergence: NormFormer reaches the baseline perplexity in 60% of the GPU-time, or achieves 0.27 lower perplexity in fixed compute (Shleifer et al., 2021).
Consistent gains in downstream transfer: GLUE average score increases by $h = \phi(\text{LayerNorm}(W_1x + b_1))$ 6 points for a 125M param model.
Streamlined learning dynamics: bias-neutral initializations avoid two-phase dynamics and accelerate transition to informative gradients (Francazi et al., 16 May 2025).

7. Design Principles and Architectural Guidelines

Key empirically and theoretically supported guidelines (Francazi et al., 16 May 2025, Shleifer et al., 2021):

Use LayerNorm immediately after nonlinearity (Linear → ReLU/GELU → LN) to guarantee neutral initial state and balanced convergence.
In transformer FFN blocks, extra LN after the first linear+GELU (NormFormer's scheme) stabilizes gradients at all depths and improves both speed and accuracy.
RMSNorm may be preferred if preservation of full LLC is desired, but does not provide gradient norm balancing at the FFN sublayer.
The geometric cost (LLC penalty) of LayerNorm is structural and should be considered if Bayesian model complexity is critical to application.
In highly correlated datasets (e.g., MNIST), residual bias after LN is possible; introducing small random input shifts can mitigate this effect.

Summary Table: LayerNorm Placement Effects

Placement Variant	Initialization Bias	Gradient Norm Balancing	LLC Drop
Linear → LayerNorm → ReLU	Prejudiced	No	$h = \phi(\text{LayerNorm}(W_1x + b_1))$ 7
Linear → ReLU → LayerNorm	Neutral	Yes	$h = \phi(\text{LayerNorm}(W_1x + b_1))$ 8
Linear → RMSNorm (any point)	Neutral	No	$h = \phi(\text{LayerNorm}(W_1x + b_1))$ 9
Linear → ReLU (no normalization)	Prejudiced	No	$h = \text{LayerNorm}(\phi(W_1x + b_1))$ 0
NormFormer: Pre-LN + LN after FFN Linear+Act	Neutral	Yes	$h = \text{LayerNorm}(\phi(W_1x + b_1))$ 1

The insertion of LayerNorm after the first feed-forward linear (and activation) is a principled and empirically validated strategy for promoting gradient stability, initialization neutrality, and predictable geometric constraints in deep neural network design (Shleifer et al., 2021, Francazi et al., 16 May 2025, Chun, 28 Mar 2026, Ba et al., 2016).