Normalization-Free Transformers

Updated 16 July 2025

Transformers without normalization are architectures that forego standard normalization layers by employing methods like Fixup initialization and learnable nonlinear activations.
They address stability challenges such as exploding gradients and reduce computational overhead, making them suitable for tasks in language, vision, and graphs.
Empirical findings demonstrate that these approaches can match or even surpass traditional transformer models in efficiency, performance, and interpretability.

Transformers without normalization refers to a broad family of techniques and results challenging the previously canonical role of normalization layers—such as LayerNorm or RMSNorm—in transformer architectures. These approaches demonstrate that, with appropriate initialization, architectural modifications, or novel activation functions, transformers can achieve stable training and competitive or even superior performance in a variety of domains without relying on explicit normalization modules. Developments in this area have delivered both theoretical insights and practical recipes for training, inference, and model analysis.

1. Background and Motivation

Normalization layers in transformers have long been considered indispensable components for stable training, enabling larger learning rates, accelerating convergence, and promoting generalization. LayerNorm, in particular, has become the standard in both language and vision transformers, normalizing each token's activations before or after residual connectors. However, normalization incurs additional runtime costs, introduces batch or token-wise dependencies, complicates hardware implementation, and, according to recent works, may not always be essential for performance (1901.09321, 2503.10622, 2507.02559).

The theoretical arguments supporting normalization center on stabilizing activation and gradient statistics, preventing the accumulation of numerical instabilities and ensuring smooth optimization landscapes. Yet, empirical and mathematical investigations now show that careful initialization schemes, learned adaptive rescaling via nonlinearities, or constant-norm projections can provide similar or improved benefits even when normalization layers are omitted.

2. Key Methodologies for Normalization-Free Transformers

Several distinct strategies have been developed to train and deploy transformers without explicit normalization layers:

a. Initialization and Residual Scaling (Fixup Initialization)

Fixup initialization addresses the exploding and vanishing gradient problem in deep residual architectures by rescaling standard initializations so that the magnitude of function changes per stochastic gradient descent (SGD) step is constant, independent of network depth (1901.09321). For a network with $L$ residual blocks and $m$ layers per block, each weight is initialized and then rescaled by $L^{-1/(2m-2)}$ . Additional design choices include initializing the last layer to zero and introducing dedicated scalar biases and multipliers to each branch. These steps ensure that even very deep transformers can be trained without normalization by maintaining stable forward signal and gradient propagation.

b. Nonlinear Elementwise Activation: Dynamic Tanh (DyT)

Dynamic Tanh replaces normalization layers with a learnable, elementwise nonlinearity: $\mathrm{DyT}(x) = \gamma \cdot \tanh(\alpha x) + \beta$ where $\alpha$ is a learned scaling parameter, and $\gamma$ , $\beta$ are per-channel affine parameters (2503.10622). The $\tanh$ function mimics the "S-shaped" input–output response observed empirically in LayerNorm activations, effecting a robust squashing of extreme values that mitigates outlier effects and stabilizes activations. DyT does not compute or require batch, token, or feature statistics, making it highly hardware efficient and simple to implement. This mechanism enables normalization-free transformers that match or exceed the performance of their normalized counterparts in vision, language, generative, and self-supervised settings.

c. Approximate Normalization via Constant Norm Constraints

The anTransformer method constrains all weight vectors and activations to lie on or near a hypersphere, leveraging the concentration of measure in high dimensions (2505.22014). Rather than rescaling at each step or layer using local statistics, activations are multiplied by a precomputed or expected inverse norm: $x_{\text{normalized}} = \kappa_x x, \quad \text{where}~ \kappa_x \approx \frac{1}{\mathbb{E}[\|x\|_2]}$ All parameter rows (or columns) are also constrained to $\|w\|_2 \le 1$ . Additionally, the LERP correction modifies residual updates to prevent norm growth in deep architectures. These operations yield training stability and efficient scaling laws without the runtime overhead of standard normalization, achieving faster convergence and supporting larger batch sizes.

d. Adaptive RMS Normalization and Geometry-Preserving Attention for Graphs

For graph learning, adaptive root-mean-square normalization (AdaRMSN) preserves token magnitude while providing stability (2504.12588). The simplified $L_2$ attention mechanism explicitly incorporates input norm information: $\alpha_{ij} = \mathrm{Softmax}_j \left(\frac{q_i^T k_j}{\sqrt{D}} - \frac{1}{2\sqrt{D}}\|q_i\|_2^2 - \frac{1}{2\sqrt{D}}\|k_j\|_2^2\right)$ This design enables plain transformers to retain information necessary for distinguishing graph structures, which would be lost by standard normalization, and achieve strong empirical performance across graph datasets.

3. Empirical Results and Practical Applications

Normalization-free techniques have been empirically validated in a wide range of settings:

Vision Recognition and Generation: Vision transformers and ConvNeXt architectures with DyT or Fixup perform on par with, or better than, LayerNorm-equipped baselines on ImageNet-1K classification and generative tasks (e.g., DiT with DyT achieving strong FID scores) (2503.10622).
LLMing: In LLaMA-style LLMs, DyT yields near-identical training loss curves and comparable zero-shot evaluation scores to models with RMSNorm, provided the scaling parameter $\alpha_0$ is properly initialized (2503.10622).
Machine Translation: Fixup-initialized transformers achieve BLEU scores indistinguishable from LayerNorm-based models on IWSLT and WMT datasets (1901.09321).
Self- and Semi-Supervised Learning: Tasks including MAE pretraining in vision, DINO, wav2vec 2.0 speech modeling, and DNA sequence learning show normalization-free transformers matching or surpassing normalized counterparts (2503.10622).
Trajectory Prediction in Autonomous Driving: DyTTP, a DyT-equipped transformer, combined with snapshot ensembling and cyclical LR scheduling, achieves improved accuracy, inference speed, and robustness on the Argoverse benchmark (2504.05356).
Graph Learning: The PPGT approach, combining AdaRMSN with norm-aware attention, endows plain transformers with expressivity and empirical superiority vs. more elaborate message-passing graph neural networks (2504.12588).

These results underscore the generality and practical impact of normalization-free transformer architectures across modalities.

4. Theoretical Analysis and Training Dynamics

A central theoretical insight is that normalization-free transformers can maintain stable forward and backward signal propagation either by architectural design (e.g., controlling the scaling of residual branches in Fixup) or by leveraging high-dimensional geometries (e.g., compactification onto the hypersphere (2505.22014)). In DyT, the tanh nonlinearity alone is sufficient to bound activations and prevent pathological accumulations.

In approximate normalization approaches, the tight concentration of vector norms in high-dimensional spaces allows for simple scalar reweighting to achieve stability, avoiding the need for elementwise or tokenwise statistics. Scaling laws derived for anGPT show that performance–compute tradeoffs are maintained, and, in some cases, hyperparameter tuning is further simplified, reducing the requirement for learning rate warm-up or weight decay (2505.22014).

The removal or relaxation of normalization also has implications for optimization and efficiency. DyT and approximate normalization yield significant reductions in per-layer latency—layer-level latency is reduced by more than 50% in LLaMA 7B models with DyT compared to RMSNorm, while anTransformer introduces negligible runtime overhead at training time and none at inference (2503.10622, 2505.22014).

5. Interpretability, Robustness, and Scientific Implications

Recent work has demonstrated that normalization removal at inference time, with modest fine-tuning, has little impact on LLMing loss but dramatically clarifies mechanistic interpretability (2507.02559). Removing LN from GPT-2 models using "FakeLN" blocks (frozen standard deviations) allows direct logit attribution methods to exactly quantify the effect of individual components. Confidence neurons, observed to control model output confidence via LN, become inert when normalization is removed, evidencing the causal role of the nonlinearity.

Conversely, pruning or perturbing even a minute subset of normalization parameters (notably outlier scaling or bias weights in LayerNorm) can degrade model performance by large margins in BERT-class models (2105.06990). This underscores the delicate interplay between normalization structure and model robustness.

Theoretical work further elucidates that normalization constrains the independence of semantic subspaces critical for certain logical circuits inside transformers. Removing normalization may allow more flexible representation geometries but could reduce out-of-distribution generalization and circuit stability, unless compensated by other design elements (2406.17837).

6. Hardware Efficiency and Practical Deployment

A variety of the discussed approaches confer substantial computational advantages. DyT, UN, and anTransformer methods all substantially reduce on-the-fly computation (by avoiding calculation of means, variances, divisions, or square roots), directly improving GPU throughput and reducing memory overhead (2208.01313, 2503.10622, 2505.22014). In some cases, the removal of LayerNorm from GPT-2 yields minimal increases in validation loss while making the architecture more amenable to interpretability or latency-critical deployments (2507.02559).

Approximate or no-normalization models can also better exploit batch size scaling, requiring fewer gradient noise scale calculations and making training hyperparameter schedules more predictable (2411.00999, 2505.22014).

7. Contemporary Perspectives and Open Directions

Normalizing transformer activations is no longer regarded as a universal prerequisite for stable training or strong generalization. The evolution of this area reflects a deeper understanding of the roles played by architectural components, empirical data about representation geometry, and the requirements for scaling to larger and more diverse datasets and modalities.

Open directions include further unifying statistical and geometric approaches to stabilization, developing theoretical frameworks for signal propagation in the absence of normalization, and deploying normalization-free designs in ever-larger, multitask, or continual learning settings. Refinements in initialization, architectural motifs such as recursive skip connections, or methods for maintaining independence of semantic subspaces may further enable the creation of more transparent, efficient, and adaptable neural architectures.

Normalization-free transformers, by moving beyond legacy constraints, offer new avenues for model design, interpretability, hardware efficiency, and theoretical analysis across domains.