PreLayerNorm Patch Embedding for ViTs

Updated 22 February 2026

PreLayerNorm Patch Embedding is a technique in Vision Transformers that applies LayerNorm before and after patch embedding to stabilize gradients and enhance positional encoding.
It improves performance by regularizing patch statistics and ensuring scale invariance, resulting in notable accuracy and robustness gains on benchmarks like ImageNet.
Variants such as Dual PatchNorm and LaPE implement distinct normalization strategies, offering flexible architectural refinements with negligible computational overhead.

PreLayerNorm Patch Embedding in Vision Transformers refers to a family of architectural modifications that insert Layer Normalization (LN) strategically before and/or after the patch embedding and position embedding stages in Vision Transformer (ViT) models. These modifications, including PreLayerNorm stems, Dual PatchNorm, and independent normalization of patch and position embeddings, aim to address optimization stability, robustness to input variations, and the expressive utilization of positional information—while incurring negligible computational or parameter overhead.

1. Architectural Overview

Standard ViT architectures process an image by extracting non-overlapping patches, flattening each patch, and mapping the result via a linear projection to obtain token (patch) embeddings $x \in \mathbb{R}^{N \times D}$ , followed by the addition of a position embedding $p \in \mathbb{R}^{N \times D}$ . Conventionally, this sum $x + p$ is either normalized once before entering the transformer encoder (“single-LN”) or not normalized (“no-LN”), depending on the variant.

PreLayerNorm Patch Embedding modifies this workflow by introducing one or more LayerNorms before, after, or in between patch-projection and position embeddings. Key variants include:

PreLayerNorm on Patch Embeddings: LayerNorm applied to either raw or projected patch vectors before position embedding addition (Kim et al., 2021).
Dual PatchNorm: Two LayerNorms—one before the linear patch embedding, one after but before position embedding addition (Kumar et al., 2023).
Independent Normalization for Patch and Position Embeddings (LaPE): Separate LayerNorms for each of $x$ and $p$ , summed as input to each encoder layer, with their own learnable scales and biases (Yu et al., 2022).

These variants can be composed; for example, applying Dual PatchNorm at the stem and independent LayerNorms in each encoder layer.

2. Mathematical Formalism and Implementation

The core of PreLayerNorm Patch Embedding is the application and placement of LayerNorm with equation: $\mathrm{LN}(u) = \gamma \ast \frac{u - \mathbb{E}[u]}{\sqrt{\mathrm{Var}[u] + \varepsilon}} + \beta$ where $\gamma, \beta \in \mathbb{R}^d$ .

Key workflows:

Patchify → Pre-LN → Dense → Post-LN → Add Position Embedding (Kumar et al., 2023):

x = patchify(image)  # B x N x P
x = LayerNorm(x)     # LN₀: pre-embedding
x = Dense(x)         # Wₚ x + bₚ
x = LayerNorm(x)     # LN₁: post-embedding
x = x + position_embedding

Layer-adaptive Position Embedding (LaPE) (Yu et al., 2022):

$z_l = \mathrm{LN}_x(x_l) + \mathrm{LN}_p(p)$

with independently learned $(\gamma_x, \beta_x)$ and $(\gamma_p, \beta_p)$ for tokens and position respectively, at each encoder layer $l$ .
PreLayerNorm on Raw Patch Vectors (Kim et al., 2021):

$E_{\mathrm{PLN}}(x_p) = W [LN_{\mathrm{pre}}(x_p)] + b + P_p$

The LaPE operator can be used for absolute or relative position encodings, with the independence of normalization enabling dynamic adaptation across layers.

3. Theoretical Motivation and Analytical Properties

Several issues are addressed by PreLayerNorm Patch Embedding:

Scale-Invariance: ViTs are vulnerable to degradations under input contrast variations because the positional embedding, when added at fixed scale before normalization, can become negligible as the patch embedding scale varies. PreLayerNorm absorbs global scale and shift before adding position embedding, achieving exact invariance to linear channel-wise rescaling and bias— $LN_{\mathrm{pre}}(aX + b) = LN_{\mathrm{pre}}(X)$ —so $z_{\mathrm{pln}}(aX + b) = z_{\mathrm{pln}}(X)$ , which does not hold for $LN(x + p)$ (Kim et al., 2021).
Regularization of Patch Statistics: Applying a LayerNorm before patch projection standardizes raw patch statistics, providing more uniform, well-conditioned inputs to the linear mapping $W_p$ (Kumar et al., 2023).
Stabilization of Gradients: Dual PatchNorm suppresses extreme gradient values at the stem—gradient norms at the patch embedding layer are brought in line with the rest of the ViT backbone, improving optimization (Kumar et al., 2023).
Expressivity of Position Embedding: Single-LN approaches force a shared affine transform on two semantically distinct signals (patch content, position info), which can diminish the effect of positional bias as shown by reparameterizations. Empirical cosine-similarity analyses show that single-LN often restricts learned PEs to monotonic or degenerate directions, whereas independent LayerNorms create richer, more hierarchical, and more global positional correlations as depth increases (Yu et al., 2022).

4. Comparative Analysis of Schemes

PreLayerNorm variants are contrasted in the following table:

Scheme	LayerNorm Placement	Key Properties
Default ViT (Single-LN)	LN(x + p) (pre-attention)	Shared scale for $x$ , $p$ ; prone to “washed out” PE; not scale-invariant (Yu et al., 2022, Kim et al., 2021)
No-LN / Post-LN	(None or only after residual addition)	Possible instability; less robust to input shifts (Yu et al., 2022)
PreLayerNorm (Kim et al., 2021)	LN on patches pre-projection, or pre-pos	Achieves scale invariance; minimal complexity (Kim et al., 2021)
Dual PatchNorm (DPN)	LN pre-projection, LN post-projection	Conditions patch statistics, stabilizes gradients (Kumar et al., 2023)
LaPE (Two-LN)	LN on $x$ , LN on $p$ , then sum	Adaptive per-layer token/positional contribution, richer PE utilization (Yu et al., 2022)

Empirical ablations indicate that both pre- and post-projection LNs are necessary for maximal performance gain; omitting either reduces accuracy improvements (Kumar et al., 2023).

5. Experimental Results

PreLayerNorm Patch Embedding variants consistently provide improvements on top-1 accuracy, robustness to corruptions, and training dynamics:

Dual PatchNorm (DPN) (Kumar et al., 2023):
- ImageNet-1k, ViT-B/16: $80.4 \rightarrow 81.1$ (+0.7)
- ViT-B/32: $74.8 \rightarrow 76.2$ (+1.4)
- ViT-Ti/16: $72.5 \rightarrow 73.9$ (+1.4)
- Ablations show dual LN is essential; pre-only (-2.6), post-only (-0.5), dual (+1.4).
LaPE (Yu et al., 2022):
- CIFAR-10 (ViT-Lite): $93.45\% \rightarrow 94.39\%$ (+0.94)
- CIFAR-100 (CCT): $80.93\% \rightarrow 81.90\%$ (+0.97)
- ImageNet-1k (DeiT-Ti, learnable PE): $71.54\% \rightarrow 73.26\%$ (+1.72)
- ImageNet-1k (DeiT-Ti, 1D sinusoidal PE): $67.70\% \rightarrow 72.22\%$ (+4.52)
- Training/inference overhead is negligible (extra params per layer for $(\gamma_p, \beta_p)$ ).
Robustness Improvements (Kim et al., 2021):
- Under severe contrast change (factor=1.5), ViT-L drops from $45.2\%$ to $68.1\%$ accuracy on Stanford Cars due to PreLayerNorm; without PreLayerNorm, effective positional embedding utilization (ECPE) declines sharply.
Downstream tasks and transfer (Kumar et al., 2023):
- +0.4–1.7 points on ImageNet-21K, JFT-4B 25-shot transfer.
- VTAB and ADE-20K segmentation see consistent, sometimes substantial improvements.

6. Implementation Guidance and Limitations

PreLayerNorm Patch Embedding modifications are straightforward to adopt:

Code Integration: Add a LayerNorm after patch extraction and/or after patch projection but before positional embedding addition. For LaPE, add independent LN modules for $x$ and $p$ at each encoder layer. For relative position encoding, apply LN on the learned relative bias tables (Yu et al., 2022).
Training Protocol: No hyperparameter search or training schedule changes are necessary. Initializations for LayerNorm scales and shifts follow the usual conventions (γ=1, β=0).
Parameter/Compute Overhead: Dual PatchNorm and LaPE have negligible parameter and compute increases compared to the backbone.
Best Use Cases: Robustness-critical domains (robotics, medical imaging, surveillance, and adverse lighting), and ViT architectures of varying capacity.

Limitations include no mitigation of nonlinear channel-wise transformations (e.g., saturation), and diminished marginal benefit for extremely prolonged pretraining or large-patch settings. Strong augmentations and other regularization techniques remain orthogonal and beneficial.

7. Significance and Broader Impact

PreLayerNorm Patch Embedding constitutes a minimal yet theoretically grounded refinement to the ViT architecture. By explicitly regularizing patch statistics and/or granting position embeddings an independent normalization path, these modifications improve optimization stability, empirical accuracy, convergence speed, and robustness to image corruptions such as contrast change. The approach is validated across multiple backbones (ViT, DeiT, CCT, CVT, Swin-Tiny, CeiT-Small) and datasets (CIFAR, ImageNet, Stanford Cars/Dogs, VTAB, ADE-20K), with benefits observed consistently and no degredation reported in any regime.

By enabling richer positional representation and ensuring scale-invariance, these techniques align the empirical and theoretical desiderata for ViT systems operating in challenging, real-world conditions. They serve as plug-and-play upgrades for existing codebases, requiring only minor LayerNorm insertions at the embedding stage, and represent a practical best-practice for state-of-the-art vision transformer deployment (Kumar et al., 2023, Yu et al., 2022, Kim et al., 2021).

Markdown Report Issue Upgrade to Chat

References (3)

Improved Robustness of Vision Transformer via PreLayerNorm in Patch Embedding (2021)

Dual PatchNorm (2023)

Position Embedding Needs an Independent Layer Normalization (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PreLayerNorm Patch Embedding.