Dynamic Tanh Normalization in Neural Networks

Updated 25 January 2026

Dynamic Tanh Normalization Module is an elementwise, bounded activation-based method that approximates LayerNorm through a learnable tanh scaling function.
It delivers improved computational throughput, reduced latency, and enhanced hardware efficiency in models such as Transformers, vision networks, and language models.
Careful tuning of the scaling parameter is crucial to balance speed and stability, preventing saturation and vanishing gradients in deep architectures.

The Dynamic Tanh Normalization Module, frequently referred to in modern literature as “Dynamic Tanh” or DyT, is an elementwise, bounded activation-based module introduced to replace statistical normalization operations—most notably Layer Normalization (LN)—in deep neural network architectures such as Transformers. In contrast to normalization layers that require mean and variance reductions over input data, DyT achieves normalization-like effects via a learnable, data-driven scaling of a $\tanh$ squashing function, enabling improved computational throughput, lower latency, and hardware efficiency, while maintaining or exceeding the downstream performance of traditional normalizers in a variety of domains including computer vision, LLMs, and generative modeling (Zhu et al., 13 Mar 2025, Stollenwerk, 27 Mar 2025, Byun et al., 26 Dec 2025).

1. Theoretical Foundation and Derivation

Dynamic Tanh Normalization is derived as an approximation to Layer Normalization by analyzing the functional relationship between the two. Standard LN computes

$y_i = \frac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$

where $x = (x_1, \ldots, x_C) \in \mathbb{R}^C$ , $\mu$ and $\sigma^2$ denote the mean and variance, and normalization is performed across the feature/channel dimension. The elementwise derivative with respect to input is

$\frac{dy_i}{dx_i} = F(x) \left( (C-1) - y_i^2 \right),\quad F(x) = \frac{1}{C\sqrt{\sigma^2}}.$

Approximating $F(x)$ as a constant leads to an ODE whose solution is a scaled hyperbolic tangent: $y_i(x_i) = \sqrt{C-1}\ \tanh(\alpha x_i),\quad \alpha = F\sqrt{C-1}.$ In practice, $\alpha$ is learned and $\sqrt{C-1}$ is typically absorbed, yielding the formulation

$y_i = \tanh(\alpha(x_i-\mu))$

which constitutes the core of the DyT module (Stollenwerk, 27 Mar 2025). This reveals that DyT approximates LN under the assumption of locally constant variance and mean.

Dropping the constant- $F$ approximation and rigorously integrating the ODE yields the Dynamic Inverse Square Root Unit (DyISRU), an analytic, elementwise mapping that more faithfully mirrors LN: $y_i = \sqrt{C-1}\frac{x_i-\mu}{\sqrt{\beta + (x_i-\mu)^2}}$ with $\beta$ as a trainable or channel-dependent parameter (Stollenwerk, 27 Mar 2025).

2. Formal Definition and Implementation

The standard DyT module is implemented as: $z = \gamma\odot\tanh(\alpha x) + \beta$ with

$\alpha\in\mathbb{R}$ a trainable scaling parameter,
$\gamma, \beta \in \mathbb{R}^{C}$ trainable gain and bias (affine) parameters,
$\odot$ indicating channelwise multiplication (Zhu et al., 13 Mar 2025, Byun et al., 26 Dec 2025).

Initialization and variants:

$\alpha$ can be global (per-layer), per-channel, or per-head in multi-head attention,
Layer-specific tuning of $\alpha$ is necessary in deeper transformer regimes (e.g., $\alpha_{0}^{\mathrm{attn}}$ is reduced as width or depth increases for LLMs),
Default $\alpha_0=0.5$ is robust for vision/speech domains, while LLMs require lower values (e.g., $0.8\rightarrow0.2$ for increasing model size).

A PyTorch-style implementation is as follows:

class DyT(nn.Module):
    def __init__(self, dim, alpha_init=0.5):
        super().__init__()
        self.alpha = nn.Parameter(torch.ones(1) * alpha_init)
        self.gamma = nn.Parameter(torch.ones(dim))
        self.beta = nn.Parameter(torch.zeros(dim))

    def forward(self, x):
        z = torch.tanh(self.alpha * x)
        return self.gamma * z + self.beta

(Zhu et al., 13 Mar 2025)

3. Integration in Modern Neural Architectures

In deep architectures such as transformers, DyT is designed as a drop-in replacement for all instances of LN (and RMSNorm in some contexts). The integration is elementwise and preserves the residual block structure:

Pre-norm regime:

1 2	y1 = x + Attention( DyT(x) ) y2 = y1 + MLP( DyT(y1) )

Post-norm and final-layer variants are supported analogously.
No other modifications to optimizer, learning rate schedule, or residual scaling are required.
Embedding-scale reparameterization (learnable scale initialized to $\sqrt{d}$ ) is recommended for strong stability in LLMs (Zhu et al., 13 Mar 2025, Byun et al., 26 Dec 2025).

In generative adversarial networks (GANs), Dynamic Tanh Normalization can be composed as “BN→tanh” or “BN→clip” at the generator output to pre-shape the distribution, with empirically verified acceleration of early convergence and improved histogram matching (Mullery et al., 2018).

4. Empirical Performance and Computational Efficiency

Extensive empirical results demonstrate the efficacy of DyT as a normalization alternative:

On ImageNet-1k, replacing LN with DyT in ViT-B improves accuracy (82.3%→82.5%); similar improvements or parity are observed for ConvNeXt and large ViT variants.
In LLaMA LLMs, DyT exactly matches LN performance in zero-shot metrics and loss (e.g., 7B: 0.513, 1.59 → 0.513, 1.60).
Self-supervised (MAE/DINO), diffusion (DiT), wav2vec2.0, and genomics tasks also exhibit robust transferability (Zhu et al., 13 Mar 2025).
Training/inference speed improves substantially due to removal of reduction operations:
- In LLaMA-7B, inference is $7.8\%$ faster (full model), $50\%$ faster in normalization layers.
- DyT is the fastest among tested normalization or normalization-free schemes, though stability at extreme scale can be an issue (Byun et al., 26 Dec 2025).

Ablations indicate that:

Bounded nonlinearities are essential: replacing tanh with identity, hardtanh, or sigmoid reduces performance or causes training collapse.
The learnable scaling parameter $\alpha$ is crucial: omitting it yields noncompetitive models (Zhu et al., 13 Mar 2025).

5. Stability, Limitations, and Alternatives

While DyT provides significant efficiency gains, it introduces certain challenges:

Lack of explicit mean/variance control: When activations drift, $\alpha x$ can enter the saturating regime of $\tanh$ , leading to vanishing gradients and potential training collapse in deep or high learning rate scenarios (“curse of depth”).
Fragility increases with model width or depth; careful tuning of $\alpha_0$ or learning rate is required (Byun et al., 26 Dec 2025).

Alternatives such as Bounded Hyperbolic Tanh (BHyT) (Byun et al., 26 Dec 2025) and HoloNorm (Yongueng et al., 13 Nov 2025) have been proposed to address these shortcomings:

BHyT dynamically rescales inputs to $\tanh$ using per-block statistics, ensuring most activations remain in the non-saturating range via Chebyshev-bound guarantees. It combines a single variance reduction per block with a fast variance-approximation, achieving both theoretical stability and high throughput in LLM pretraining.
HoloNorm retains global geometry by mapping vectors into the open $p$ -norm unit ball, preserving orthogonality and direction, overcoming the spurious correlation and distortion induced by componentwise tanh normalization (Yongueng et al., 13 Nov 2025).

6. Best Practices and Practical Considerations

A set of empirically validated guidelines for deploying DyT includes:

Initialization: $\alpha \approx \sqrt{C-1}/\sigma$ for stability, where $\sigma$ is the pre-normalization standard deviation; for GANs or image generators, optionally pre-match affine parameters to the target output distribution mean and std (Mullery et al., 2018).
Learning dynamics: It is beneficial to freeze $\alpha$ (or $\beta$ in DyISRU) early in training, then unfreeze. Monitor for collapse by ensuring these parameters remain in a nontrivial range.
Model width/depth scaling: Lower $\alpha_0$ is required as model size increases. If instability or divergence is observed, reducing either $\alpha_0$ or the learning rate is a first remedy (Zhu et al., 13 Mar 2025).
Limitations: DyT is not an effective drop-in replacement for batch normalization in spatial convolutional networks (e.g., ResNet-50, VGG19)—this context yields marked accuracy drops.

DyT and its extensions can be plugged into any neural architecture with LN or RMSNorm as a direct drop-in, and their efficiency-stability trade-offs make them well-suited for large-scale applications where hardware throughput and regularization are critical.

7. Comparative Summary

Method	Reduction Ops	Theoretical Stability	Efficiency	Preserves Geometry
LayerNorm	2 per block	High	Moderate	Yes
RMSNorm	1 per sublayer	Moderate	Moderate	Yes
DyT	0	Low (deep/LLMs)	Highest	No (componentwise)
DyISRU	0	Approx. equal to LN	High	No
BHyT	1 per block	High	High	No (componentwise)
HoloNorm	0	High	High	Yes (norm-aware)

BHyT is the practical recommendation for large-depth, high-throughput scenarios demanding both speed and numerical stability, while HoloNorm addresses geometric deficiencies of tanh-based normalizers (Yongueng et al., 13 Nov 2025, Byun et al., 26 Dec 2025, Stollenwerk, 27 Mar 2025).

The Dynamic Tanh Normalization Module represents a central element in the current shift toward normalization-free or normalization-efficient network designs, with systematic analysis showing that its formal derivation from LN enables flexible trade-offs in accuracy, speed, and numerical robustness, all substantiated in contemporary empirical literature (Zhu et al., 13 Mar 2025, Stollenwerk, 27 Mar 2025, Byun et al., 26 Dec 2025, Yongueng et al., 13 Nov 2025, Mullery et al., 2018).

Markdown Upgrade to Chat

References (5)

Transformers without Normalization (2025)

The Mathematical Relationship Between Layer Normalization and Dynamic Activation Functions (2025)

Bounded Hyperbolic Tangent: A Stable and Efficient Alternative to Pre-Layer Normalization in Large Language Models (2025)

Batch Normalization in the final layer of generative networks (2018)

Holonorm (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Tanh Normalization Module.