Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 93 tok/s
Gemini 2.5 Pro 56 tok/s Pro
GPT-5 Medium 25 tok/s
GPT-5 High 22 tok/s Pro
GPT-4o 98 tok/s
GPT OSS 120B 452 tok/s Pro
Kimi K2 212 tok/s Pro
2000 character limit reached

Equivariant RMS Layer Normalization

Updated 27 August 2025
  • The paper highlights that ERMS LN reformulates layer normalization to enforce scale and translation equivariance, enabling consistent invariance under input transformations.
  • It employs a geometric projection onto hyperellipsoids and an RMS-based approach to preserve symmetry properties while reducing computational overhead.
  • Empirical evaluations in image and language models confirm that ERMS LN delivers efficiency gains and robust performance compared to traditional normalization methods.

Equivariant Root-Mean-Square Layer Normalization (ERMS LN) is an architectural and algorithmic strategy for enforcing scale- and translation-equivariant normalization in neural networks, with particular emphasis on layerwise normalization operations that preserve essential symmetry properties. ERMS LN generalizes LayerNorm and RMSNorm by treating normalization as a geometric and algebraic transformation with explicit invariance under scaling, shifting, and (in some settings) translation of the input or feature maps. This approach has implications for theoretical analysis, model robustness, efficiency, and downstream performance in various neural architectures.

1. Mathematical Formulation and Geometric Foundation

ERMS LN is founded on the principle that normalization layers should commute with scaling and/or shift transformations: for input xx, scalar λ>0\lambda > 0, and constant μ\mu, it is desirable that f(λx+μ)=λf(x)+μf(\lambda x + \mu) = \lambda f(x) + \mu (Herbreteau et al., 2023). Standard RMSNorm computes:

yi=xiRMS(x)gi+bi,whereRMS(x)=1ni=1nxi2y_i = \frac{x_i}{\text{RMS}(x)} \cdot g_i + b_i, \quad \text{where} \quad \text{RMS}(x) = \sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2}

LayerNorm, in contrast, also removes the mean before normalizing:

y=dx(1^x)1^x(1^x)1^2y = \sqrt{d} \frac{x - (\hat{\mathbf{1}}\cdot x)\hat{\mathbf{1}}}{\|x - (\hat{\mathbf{1}}\cdot x)\hat{\mathbf{1}}\|_2}

ERMS LN further interprets normalization as mapping vectors onto the interior of an (N1)(N-1)-dimensional hyperellipsoid embedded in an NN-dimensional space (Riechers, 7 May 2024, Gupta et al., 19 Sep 2024). The key operations are:

  1. Projection onto a subspace (removing mean, when translation equivariance is required).
  2. Nonlinear normalization using an equivariant norm, typically the RMS or LpL^p norm.
  3. Affine transformation by learned gain and bias.

The formulation ensures that for any vector xRnx\in\mathbb{R}^n, ERMS LN commutes with scaling: RMS(λx)=λRMS(x)\text{RMS}(\lambda x) = \lambda\, \text{RMS}(x), so normalization is scale-equivariant (Zhang et al., 2019).

2. Equivariance Properties and Theoretical Guarantees

The primary target of ERMS LN is scale equivariance, but full normalization-equivariance (to both scaling and shifting) is achievable with proper architectural constraints. This requires:

  • Affine parameters applied channelwise, not spatially, for shift equivariance (Scanvic et al., 26 May 2025).
  • Norm (RMS) statistic computed over the dimensions subject to transformation (e.g., spatial dimensions H,WH, W for images).

The condition for translation equivariance is summarized:

Condition Shift Equivariance Continuous Translation Equivariance
Affine only channelwise ✔️ ✔️
Norm computed over spatial dims ✔️

This is critical in vision: BatchNorm and LayerNorm-AF (alias-free variant) exhibit translation equivariance when the scaling is spatial (Scanvic et al., 26 May 2025).

3. RMSNorm: Efficiency, Re-Scaling Equivariance, and Implicit Adaptivity

RMSNorm omits mean subtraction, yielding superior efficiency—reducing running time by 7% to 64% in real models (Zhang et al., 2019, Jiang et al., 2023, Graef et al., 12 Jul 2024). Its mathematical structure:

RMS(x)=1ni=1nxi2\text{RMS}(x) = \sqrt{\frac{1}{n}\sum_{i=1}^n x_i^2}

guarantees linearity under scaling, i.e., RMS(αx)=αRMS(x)\text{RMS}(\alpha x) = \alpha \text{RMS}(x), so the output normalization is scale-invariant. The gradient analysis shows that scale changes in WW yield inverse-proportional adaptations in the gradient, conferring implicit learning rate adaptation (Zhang et al., 2019).

Mechanistic evidence indicates that in LLMs, hidden vectors are nearly orthogonal to the uniform direction; thus, the mean removal by LayerNorm is redundant. RMSNorm thus regulates norms without sacrificing representational quality (Gupta et al., 19 Sep 2024).

4. Practical Architectural Implications and Algorithmic Strategies

In transformer and convolutional architectures, ERMS LN enables several practical strategies:

  • Removal of mean subtraction when prior or architectural guarantees provide zero-mean features (Jiang et al., 2023). Pre-LN transformer branches are naturally zero-mean; LayerNorm can be replaced by RMSNorm with no loss of expressivity, confirming their arithmetic equivalence.
  • Lossless compression (CRMSNorm): one dimension can be omitted for zero-mean vectors, further improving efficiency (Jiang et al., 2023).
  • FlashNorm technique: merges normalization weights into the next linear layer and defers normalization, optimizing both kernel launch overhead and parallelism for LLMs (Graef et al., 12 Jul 2024).

Additionally, normalization-equivariant architectures are achieved by constraining convolutional layers (weight sum-to-one) and replacing pointwise activations by channelwise sort pooling, ensuring f(λx+μ)=λf(x)+μf(\lambda x+\mu) = \lambda f(x)+\mu throughout the network (Herbreteau et al., 2023). These architectural designs render all normalization steps compatible with input scaling and shifting, thus improving conditioning and generalization.

5. Geometric Interpretation and Sphere-Based Normalization

LayerNorm and its equivariant extensions are interpreted geometrically as projection onto a sphere in the subspace orthogonal to the uniform vector (Sun et al., 2020, Riechers, 7 May 2024, Gupta et al., 19 Sep 2024). Specifically:

  • LayerNorm: removes the mean, normalizes variance, then scales so the output lies on an (N1)(N-1)-dimensional sphere or hyperellipsoid.
  • RMSNorm: normalizes with respect to RMS, which is proportional to the L2L^2 norm; when the mean is zero, it is strictly equivalent to LayerNorm (Zhang et al., 2019).
  • ERMS LN: if built on the RMS of centered activations, inherits translation and scaling equivariance, mapping outputs to the sphere Sn2(n)S_{n-2}(\sqrt{n}), with directionality and scale separate (Sun et al., 2020).

Eigen-decomposition of the gain matrix reveals which directions are stretched or squashed by normalization; principled extensions may enforce invariance/equivariance under symmetry groups by aligning normalization axes (Riechers, 7 May 2024).

6. Empirical Performance and Robustness

Experiments show that ERMS LN and related normalization layers maintain or exceed the performance of LayerNorm and BatchNorm, especially in resource-constrained settings (Zhang et al., 2019, Jiang et al., 2023):

  • CIFAR-10, ImageNet (ResNet-18): ENorm (energy normalization—a batch-independent, weight balancing approach) improves test accuracy for small batches vs BatchNorm/GroupNorm (Stock et al., 2019).
  • Transformers: RMSNorm (and Pre-RMSNorm, Pre-CRMSNorm) achieves comparable accuracy to LayerNorm in ViT and GPT, with 1–10% speedup (Jiang et al., 2023).
  • Image Denoising: Normalization-equivariant architectures generalize better across noise levels (Herbreteau et al., 2023). The RMS-type normalization ensures that outputs “track” input transformations.

The use of LpL^p alternatives (beyond p=2p=2 for RMS) in normalization and adaptive optimizers yields further control over outliers and heavy-tailed distributions, providing a mechanism for more robust normalization (Daley et al., 2021).

7. Current Limitations and Design Constraints

Critical constraints for ERMS LN and related equivariant normalizations include:

  • Affine transform restriction: Channelwise-only affine parameters must be enforced for equivariance across spatial dimensions; spatially-dependent affine parameters break translation equivariance (Scanvic et al., 26 May 2025).
  • Dimension of normalization: The norm statistic (RMS/variance) must be computed over the correct dimensions (e.g., spatial for images) to avoid aliasing and loss of equivariance under continuous translation (Scanvic et al., 26 May 2025).
  • Irreversibility of LayerNorm: Mean subtraction is irreversible (cannot be undone after normalization), meaning LayerNorm (when applied to hidden vectors naturally orthogonal to 𝟙) adds unnecessary computation (Gupta et al., 19 Sep 2024).

Empirical results reinforce these requirements—only normalization strategies meeting them exhibit measured shift and translation equivariance on real data.


In summary, Equivariant Root-Mean-Square Layer Normalization represents a class of normalization procedures in neural networks designed around scale and translation equivariance. By restructuring the computation and imposing exact architectural constraints, ERMS LN achieves robust, efficient, and theoretically principled normalization with competitive empirical performance. The approach is applicable in modern deep architectures, especially where symmetry preservation and computational resources are key considerations, and it provides a blueprint for future research in normalization theory and practice.