RescaleNorm in Transformer Models

Updated 22 June 2026

RescaleNorm is a family of normalization techniques based on RMSNorm that scales activations to improve transformer model efficiency.
Static variants like SLaNC use precomputed scaling factors to mitigate FP16 numerical issues while reducing computational overhead.
Dynamic variants such as SeeDNorm adapt scaling per input, enhancing convergence, zero-shot accuracy, and overall model performance.

RescaleNorm denotes a family of normalization techniques—particularly Root Mean Square Normalization (RMSNorm)—that perform normalization by scaling activations according to their root mean square, with or without recentering. In transformer architectures, especially within Pre-LN variants, RescaleNorm provides a computationally efficient alternative to Layer Normalization (LayerNorm), preserving many of its desiderata while reducing both arithmetic complexity and training/inference cost. Recent work has further extended RescaleNorm to statically and dynamically rescaled schemes for hardware efficiency and adaptive representational capacity.

1. Mathematical Formulation of RescaleNorm

Let $x \in \mathbb{R}^d$ be the input vector and $\varepsilon > 0$ a smoothing constant. RMSNorm scales $x$ by its root mean square per dimension: $\mathrm{RMSNorm}(x) = \frac{x}{\sqrt{ \|x\|_2^2/d + \varepsilon }}$ Component-wise,

$\mathrm{RMSNorm}(x)_i = \frac{x_i}{ \sqrt{ \frac{1}{d} \sum_{j=1}^d x_j^2 + \varepsilon } }$

Unlike LayerNorm, RMSNorm omits centering ( $x - \mu(x)$ ) and focuses solely on scale normalization. In Pre-LN Transformers, the zero mean is often made redundant due to subsequent operations; thus, RescaleNorm reduces computational overhead without loss of model expressivity when recentering is properly handled (Jiang et al., 2023).

2. Equivalence and Conversion: LayerNorm, RMSNorm, and CRMSNorm

LayerNorm for $x \in \mathbb{R}^d$ is defined as

$\mathrm{LayerNorm}(x) = \frac{x - \mu(x)}{\sqrt{ \operatorname{Var}(x) + \varepsilon }}$

with $\mu(x) = \frac{1}{d} \sum x_i$ and $\operatorname{Var}(x) = \frac{1}{d}\|x\|_2^2 - \mu(x)^2$ . RMSNorm is equivalent to LayerNorm on zero-mean inputs.

Pre-LN Transformers can be systematically converted to Pre-RMSNorm by:

Recenering block inputs: $\varepsilon > 0$ 0
Modifying residual-projection parameters to maintain overall function
Replacing all LayerNorm layers with RMSNorm on recentered vectors

Compressed RMSNorm (CRMSNorm) further exploits the (d–1)-dimensionality of zero-mean vectors, losslessly compressing activations by discarding one coordinate and restoring it when needed. The CRMSNorm formula becomes: $\varepsilon > 0$ 1 Theoretical results establish arithmetic equivalence: for any Pre-LN Transformer, there exists a Pre-RMSNorm and Pre-CRMSNorm counterpart realizing the same network function (Jiang et al., 2023).

3. Static and Dynamic RescaleNorm Variants

Static RescaleNorm (SLaNC): To address numerical issues of LayerNorm (underflow or overflow in low-precision FP16 hardware due to variance calculation), Static LayerNorm Calibration (SLaNC) computes a static rescaling factor based solely on the preceding linear weights:

The static scale $\varepsilon > 0$ 2 is computed from post-norm weights, e.g., for standard MLPs: $\varepsilon > 0$ 3 with $\varepsilon > 0$ 4. For attention and Llama-style MLPs, analogous operator norms are used.
At inference, inputs are divided by $\varepsilon > 0$ 5, $\varepsilon > 0$ 6 is rescaled as $\varepsilon > 0$ 7, and LayerNorm proceeds unmodified.

This procedure bounds the sum-of-squares within the numerical safety range of FP16, with no accuracy loss or runtime overhead except a divide and precomputed $\varepsilon > 0$ 8 (Salmani et al., 2024).

Dynamic RescaleNorm (SeeDNorm): SeeDNorm introduces a data-dependent scaling factor. For input $\varepsilon > 0$ 9, SeeDNorm computes: $x$ 0

$x$ 1

$x$ 2

$x$ 3

where $x$ 4 are learnable, and $x$ 5 is an optional bias. The dynamic scaling adapts per input, preserving original input norm information and facilitating improved generalization in distribution shift and zero-shot scenarios (Cai et al., 26 Oct 2025).

4. Computational and Empirical Impact

Replacing LayerNorm with RescaleNorm in transformers yields substantial efficiency gains:

RMSNorm and CRMSNorm remove mean subtraction and square-root-of-variance steps, reducing FLOPs by $x$ 6– $x$ 7 in normalization (Jiang et al., 2023).
In empirical evaluations, Pre-RMSNorm and Pre-CRMSNorm transformers demonstrate $x$ 8– $x$ 9 wall-clock speedup on A100 GPUs across both inference and training in ViT and GPT-style models.
SLaNC allows exact-FP32 normalization on FP16 hardware without numerical instability, maintaining baseline perplexity on Wikitext-2 in LLaMA models and strictly confining squared sum ranges within hardware limits (Salmani et al., 2024).
SeeDNorm, despite minimal parameter and compute overhead, achieves lower perplexity and higher zero-shot accuracy in LLM training benchmarks and boosts top-1 accuracy in vision models (e.g., ViT-B: $\mathrm{RMSNorm}(x) = \frac{x}{\sqrt{ \|x\|_2^2/d + \varepsilon }}$ 0 for SeeDNorm vs. $\mathrm{RMSNorm}(x) = \frac{x}{\sqrt{ \|x\|_2^2/d + \varepsilon }}$ 1 for LayerNorm) (Cai et al., 26 Oct 2025). Its dynamic scaling also accelerates convergence in long-horizon training tasks.

5. Theoretical Guarantees, Practical Implementation, and Limitations

The arithmetic equivalence theorems assert that Pre-LN, Pre-RMSNorm, and Pre-CRMSNorm variants are functionally identical in both inference and SGD-based training, so no fine-tuning is needed to switch between them (Jiang et al., 2023). SLaNC mathematically guarantees that, after static rescaling, the LayerNorm variance never exceeds safe FP16 ranges regardless of input, eliminating subnormal/overflow issues with formal norm bounds (Salmani et al., 2024).

Implementation practices include:

Compile-time calculation of static scales ( $\mathrm{RMSNorm}(x) = \frac{x}{\sqrt{ \|x\|_2^2/d + \varepsilon }}$ 2) and high-precision computation of $\mathrm{RMSNorm}(x) = \frac{x}{\sqrt{ \|x\|_2^2/d + \varepsilon }}$ 3 for SLaNC
Recenering inputs and reparametrizing residual-path linear weights for Pre-RMSNorm convertibility
Weight-regularization and controlled gating initialization for SeeDNorm stability, with Multi-Head variants further improving convergence in deep vision models (Cai et al., 26 Oct 2025)

Practical limits include sensitivity to static weight norms (extreme scales may lead to underflow or precision loss in SLaNC), and for SeeDNorm, the need for proper regularization to prevent gradient explosion or training instability. The additional $\mathrm{RMSNorm}(x) = \frac{x}{\sqrt{ \|x\|_2^2/d + \varepsilon }}$ 4 parameter cost for SeeDNorm is negligible compared to the $\mathrm{RMSNorm}(x) = \frac{x}{\sqrt{ \|x\|_2^2/d + \varepsilon }}$ 5 cost from fully-connected layers.

6. Extensions, Comparisons, and Context within the Normalization Landscape

RescaleNorm variants can be viewed as part of a larger trend in neural network normalization, where mean-centering, scale-normalization, and adaptability are split into disjoint and optimally efficient modules. Related techniques include LayerNorm, Dynamic Activation Normalization (DyT), and emerging dynamic schemes that address loss of magnitude information or destabilization of gradients.

There is no consensus on universal superiority of one normalization scheme over another; for example, RMSNorm is used in some modern LLMs while LayerNorm persists elsewhere. The equivalences proved for Pre-RMSNorm and CRMSNorm show that one can always substitute them for Pre-LN Transformers without loss of functionality or accuracy. In scenarios requiring precise numerical guarantees (e.g., quantized inference) or adaptive representational power (e.g., zero-shot generalization), static and dynamic RescaleNorm variants provide principled avenues for optimization (Jiang et al., 2023, Salmani et al., 2024, Cai et al., 26 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (3)

Pre-RMSNorm and Pre-CRMSNorm Transformers: Equivalent and Efficient Pre-LN Transformers (2023)

SLaNC: Static LayerNorm Calibration (2024)

SeeDNorm: Self-Rescaled Dynamic Normalization (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RescaleNorm.

RescaleNorm in Transformer Models

1. Mathematical Formulation of RescaleNorm

2. Equivalence and Conversion: LayerNorm, RMSNorm, and CRMSNorm

3. Static and Dynamic RescaleNorm Variants

4. Computational and Empirical Impact

5. Theoretical Guarantees, Practical Implementation, and Limitations

6. Extensions, Comparisons, and Context within the Normalization Landscape

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

RescaleNorm in Transformer Models

1. Mathematical Formulation of RescaleNorm

2. Equivalence and Conversion: LayerNorm, RMSNorm, and CRMSNorm

3. Static and Dynamic RescaleNorm Variants

4. Computational and Empirical Impact

5. Theoretical Guarantees, Practical Implementation, and Limitations

6. Extensions, Comparisons, and Context within the Normalization Landscape

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research