Transformers without Normalization (2503.10622v2)

Published 13 Mar 2025 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: Normalization layers are ubiquitous in modern neural networks and have long been considered essential. This work demonstrates that Transformers without normalization can achieve the same or better performance using a remarkably simple technique. We introduce Dynamic Tanh (DyT), an element-wise operation $DyT($x$) = \tanh(\alpha $x$)$, as a drop-in replacement for normalization layers in Transformers. DyT is inspired by the observation that layer normalization in Transformers often produces tanh-like, $S$-shaped input-output mappings. By incorporating DyT, Transformers without normalization can match or exceed the performance of their normalized counterparts, mostly without hyperparameter tuning. We validate the effectiveness of Transformers with DyT across diverse settings, ranging from recognition to generation, supervised to self-supervised learning, and computer vision to LLMs. These findings challenge the conventional understanding that normalization layers are indispensable in modern neural networks, and offer new insights into their role in deep networks.

Summary

The paper introduces Dynamic Tanh (DyT) as a novel element-wise operation that replaces traditional normalization layers in Transformer architectures.
It demonstrates that DyT achieves comparable or improved results across vision, language, speech, and genomics with minimal hyperparameter tuning.
The approach simplifies computation by eliminating cross-token statistics while preserving training stability and enhancing throughput.

The work "Transformers without Normalization" (2503.10622) investigates the necessity of normalization layers, such as Layer Normalization (LN) or Root Mean Square Normalization (RMSNorm), within Transformer architectures. It proposes a simple, element-wise operation called Dynamic Tanh (DyT) as a direct replacement, demonstrating that Transformers equipped with DyT can achieve performance comparable or superior to their normalized counterparts across a variety of tasks and modalities.

Motivation: Observed Behavior of Layer Normalization

The development of DyT stems from an empirical analysis of the input-output behavior of Layer Normalization layers within pre-trained, high-performance Transformer models, including Vision Transformer (ViT), wav2vec 2.0, and Diffusion Transformer (DiT). The paper observes that, despite LN operating linearly on each token's feature vector (by standardizing it and applying affine transformation), the collective effect across all tokens and channels often approximates a scaled hyperbolic tangent (tanh) function. Specifically, the scatter plot of output values against input values for LN layers frequently forms an $S$ -shaped curve. This suggests that a primary function of LN in these contexts might be to non-linearly squash activation values, particularly those with large magnitudes, while preserving values near zero in a roughly linear regime. This observation motivates the exploration of whether a simpler, explicit tanh-based function could replicate this effect and suffice for stable training and high performance, thereby eliminating the need for computing activation statistics across tokens.

Dynamic Tanh (DyT) Formulation

Based on the observed behavior of LN, Dynamic Tanh (DyT) is introduced as an element-wise activation function intended to replace normalization layers. Its mathematical formulation is given by:

$DyT(\mathbf{x}) = \gamma \odot \tanh(\alpha \cdot \mathbf{x}) + \beta$

Where:

$\mathbf{x}$ is an input tensor element or vector.
$\alpha$ is a learnable scalar parameter. This parameter adaptively scales the input before the tanh function is applied. It learns a global scaling factor for the activations entering the layer across all tokens and channels. Analysis indicates that $\alpha$ often learns values correlated with the inverse standard deviation ( $1/\text{std}$ ) of the input activations, effectively performing a form of implicit, collective normalization. Its dynamic, learnable nature allows it to adapt throughout training and across different layers.
$\tanh(\cdot)$ is the hyperbolic tangent function, providing the characteristic $S$ -shaped non-linearity that squashes large positive or negative values towards +1 or -1, respectively. This mimics the observed squashing effect of LN on outlier activations.
$\gamma$ and $\beta$ are learnable per-channel vectors (or scalars, depending on implementation convention), representing the standard scale and shift affine parameters found in conventional normalization layers like LN, BatchNorm, or RMSNorm. They provide channel-wise modulation of the tanh output, preserving representational capacity.

Crucially, DyT operates purely element-wise. Unlike LN or RMSNorm, it does not require calculating statistics (mean, variance, or root mean square) across any dimension (e.g., the feature dimension) of the input tensor during the forward pass. This characteristic simplifies the computation and potentially improves hardware efficiency.

Implementation as a Drop-in Replacement

DyT is designed to be a straightforward "drop-in" replacement for existing normalization layers within standard Transformer blocks. In a typical Transformer architecture, LN or RMSNorm layers are usually placed:

Before the multi-head self-attention (MHSA) module.
Before the feed-forward network (FFN) module (specifically, before the first linear layer).
Optionally, after the final output projection layer.

To implement a normalization-free Transformer using DyT, one simply substitutes each LN or RMSNorm layer with a DyT layer at these precise locations. The rest of the architecture, including the MHSA, FFN structure, activation functions (e.g., GELU, SiLU), positional embeddings, and residual connections, remains unchanged.

A significant practical advantage highlighted is that the original hyperparameters used for training the baseline normalized models often transfer directly to the DyT variants without requiring extensive tuning. This includes learning rates, weight decay, optimizers (e.g., AdamW), and schedulers. However, specific tuning, particularly for the initialization of the learnable scalar $\alpha$ (α_0), can sometimes yield further improvements, as demonstrated in the context of LLMs. For LLMs, initializing $\alpha$ with smaller values (e.g., 0.1 or 0.2 instead of the default 0.5) for wider models, and potentially using different initial values for attention versus FFN blocks, was found to be beneficial. Additionally, for LLaMA models using DyT, an extra learnable scalar applied element-wise after the token and positional embeddings was added to stabilize initial training dynamics.

import torch
import torch.nn as nn
import torch.nn.functional as F
import math

class DyT(nn.Module):
    def __init__(self, dim, eps=1e-5, alpha_init=0.5):
        """
        dim: Feature dimension
        eps: Epsilon for numerical stability (though less critical than in LN/RMSNorm)
        alpha_init: Initial value for the learnable scalar alpha
        """
        super().__init__()
        self.dim = dim
        self.eps = eps # May not be strictly needed but kept for interface consistency

        # Learnable scalar alpha, initialized
        self.alpha = nn.Parameter(torch.tensor(alpha_init))

        # Learnable per-channel scale (gamma) and shift (beta)
        self.gamma = nn.Parameter(torch.ones(dim))
        self.beta = nn.Parameter(torch.zeros(dim))

    def forward(self, x):
        """
        Input x: Tensor of shape (..., dim)
        """
        # Ensure calculations are done in float32 for stability if using mixed precision
        input_dtype = x.dtype
        if x.dtype == torch.float16 or x.dtype == torch.bfloat16:
            x = x.float()

        # Apply DyT transformation: gamma * tanh(alpha * x) + beta
        # Note: alpha is broadcasted, gamma and beta match the last dimension
        out = self.gamma * torch.tanh(self.alpha * x) + self.beta

        return out.to(input_dtype)

class TransformerBlock(nn.Module):
    def __init__(self, dim, num_heads, mlp_ratio=4.0, norm_layer=DyT):
        super().__init__()
        self.norm1 = norm_layer(dim)
        self.attn = nn.MultiheadAttention(dim, num_heads) # Placeholder
        self.norm2 = norm_layer(dim)
        mlp_hidden_dim = int(dim * mlp_ratio)
        self.mlp = nn.Sequential( # Placeholder FFN
            nn.Linear(dim, mlp_hidden_dim),
            nn.GELU(),
            nn.Linear(mlp_hidden_dim, dim)
        )

    def forward(self, x):
        # Pre-normalization structure typical in ViT/LLaMA
        x = x + self.attn(self.norm1(x)) # Simplified attention call
        x = x + self.mlp(self.norm2(x))
        return x

Experimental Validation Across Diverse Settings

The effectiveness of replacing normalization layers with DyT was extensively validated across a wide range of tasks, model architectures, and learning paradigms:

Supervised Image Recognition (ImageNet-1K): Using ViT-B/L and ConvNeXt-B/L, DyT achieved comparable or slightly improved top-1 accuracy compared to standard LN baselines (e.g., ViT-B: 83.2% DyT vs 83.1% LN; ViT-L: 85.8% DyT vs 85.7% LN). Training curves closely mirrored those of the LN versions.
Self-Supervised Vision Pretraining (ImageNet-1K): With Masked Autoencoders (MAE) using ViT-B/L and DINO using ViT-B, DyT-based models performed on par with LN counterparts in terms of linear probing and fine-tuning accuracy after pretraining (e.g., MAE ViT-L fine-tuning: 86.7% DyT vs 86.7% LN).
Generative Vision Models (ImageNet-1K 256x256): In Diffusion Transformers (DiT-B/L/XL), replacing AdaLN-Zero (a conditional variant of LN) with a conditional DyT variant yielded comparable or slightly better Fréchet Inception Distance (FID) scores (e.g., DiT-XL/2: 2.13 FID DyT vs 2.27 FID AdaLN-Zero).
LLMs: Pretraining LLaMA models (7B, 13B, 34B, 70B) on 200B tokens from The Pile dataset showed that DyT achieved nearly identical final training loss compared to the standard RMSNorm. Zero-shot evaluations on 15 downstream tasks also showed comparable average performance (e.g., LLaMA 7B: 44.5 avg score DyT vs 44.4 avg score RMSNorm). This required careful α initialization and the addition of a learnable scalar after embeddings.
Self-Supervised Speech Pretraining (LibriSpeech): For wav2vec 2.0 Base and Large models, replacing LN with DyT resulted in comparable validation loss during pretraining.
Genomics Sequence Modeling: In models like HyenaDNA and Caduceus pretrained on the human genome, DyT achieved performance comparable to LN on downstream tasks from the GenomicBenchmarks suite.

These results consistently show that DyT enables normalization-free training matching strong, established baselines across modalities and scales, often without significant hyperparameter adjustments beyond potential α initialization tuning in specific cases like LLMs.

Analysis, Ablations, and Efficiency

Further analysis provides insights into DyT's behavior and benefits:

Computational Efficiency: DyT, being element-wise, avoids the computation and communication overhead associated with calculating statistics (mean, variance, RMS) across the feature dimension required by LN and RMSNorm. Benchmarks on A100 GPUs showed measurable improvements in training throughput (e.g., +2-4% for ViT-L, +6-9% for LLaMA-7B/13B) and inference latency compared to RMSNorm.
Ablation Studies: The importance of both the tanh non-linearity and the learnable scalar α was confirmed through ablations.
- Replacing tanh(α * x) with just α * x (i.e., removing the squashing) led to training instability or divergence in ViT models.
- Fixing α to its initial value (e.g., α=0.5) instead of learning it resulted in degraded performance compared to the full DyT model, highlighting the importance of adaptive scaling.
Role and Behavior of α: The learned scalar α was observed to correlate strongly with the inverse standard deviation ( $1/\text{std}$ ) of the input activations $\mathbf{x}$ across different layers and training stages. This suggests α implicitly captures activation magnitude information, allowing tanh(α * x) to adaptively normalize the input to the tanh function's effective range. While the default initialization α_0 = 0.5 proved robust for most vision and speech models, LLMs benefited from tailored initialization strategies based on model width and layer type (attention vs. FFN).
Comparison to Other Methods: DyT was compared against other normalization-free techniques like Fixup, SkipInit, and σReparam on ViT-B ImageNet classification. DyT significantly outperformed these alternatives, demonstrating superior stability and performance in this context.

Conclusion

The research on Transformers without normalization via Dynamic Tanh (2503.10622) challenges the assumption that normalization layers are indispensable components of modern Transformer architectures. By empirically motivating and introducing the simple, element-wise DyT operation ( $DyT(\mathbf{x}) = \gamma \odot \tanh(\alpha \cdot \mathbf{x}) + \beta$ ), the work demonstrates that it can effectively replace LN or RMSNorm layers. Extensive experiments confirm that DyT-equipped Transformers achieve comparable or sometimes slightly better performance than their normalized counterparts across diverse applications, including vision, language, speech, and genomics, often with minimal hyperparameter tuning. Additionally, DyT offers potential computational efficiency benefits due to its simpler, element-wise nature. These findings suggest DyT is a viable and potentially advantageous alternative for building and training Transformer models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/liuzhuang1234/status/1916678195908116638

https://twitter.com/liuzhuang1234/status/1900370740408365448

https://twitter.com/s_scardapane/status/1920101158225359254

https://twitter.com/gabriberton/status/1936906173997019464

https://twitter.com/_akhaliq/status/1900404498809405727

https://twitter.com/Underfox3/status/1900696817680949359