Papers
Topics
Authors
Recent
Search
2000 character limit reached

Group Normalization (GN)

Updated 17 May 2026
  • Group Normalization (GN) is a feature normalization technique that partitions activations into groups, enabling consistent model behavior regardless of batch size.
  • GN enhances training stability and performance in settings with small batches, outperforming Batch Norm in various vision, biomedical, and multimodal applications.
  • Its flexible implementation allows adjustment of group counts and hybrid approaches to balance deterministic normalization with the stochastic regularization of BN.

Group Normalization (GN) is a feature normalization technique for deep neural networks, designed to address the limitations of batch-dependent normalization layers such as Batch Normalization (BN). GN operates by partitioning the feature channels of activations into groups and computing normalization statistics (mean and variance) independently within each group, which enables consistent model behavior regardless of batch size. The method provides stable training and inference in scenarios with small batch sizes or significant data heterogeneity, proving its versatility in vision, biomedical, generative, and multimodal learning tasks.

1. Formal Definition and Mechanism

Consider an activation tensor xRN×C×H×Wx \in \mathbb{R}^{N \times C \times H \times W} where NN is batch size, CC is the number of channels, and H×WH \times W is the spatial resolution. GN divides the CC channels into GG disjoint groups, each containing C/GC/G channels. For each group gg and sample nn, the mean μn,g\mu_{n,g} and variance NN0 are computed over all elements belonging to the group (across its NN1 channels and all spatial locations):

NN2

NN3

where NN4 and NN5 indexes all elements in group NN6 for sample NN7.

Normalization is then applied to each element NN8:

NN9

A learned per-channel scale CC0 and bias CC1 restore representational flexibility:

CC2

When CC3, GN reduces to Layer Normalization; when CC4, it becomes Instance Normalization (Wu et al., 2018).

2. Comparison with Other Normalization Methods

The principal distinction between GN and BN is the axis over which normalization statistics are computed and the resulting batch-size dependency:

Normalization Method Normalized Axes Batch Dependence
Batch Norm (BN) (N, H, W) per channel Yes (performance varies with batch size)
Layer Norm (LN) (C, H, W) per sample No
Instance Norm (IN) (H, W) per sample, per channel No
Group Norm (GN) (C/G, H, W) per group per sample No

BN’s reliance on batch statistics introduces instability in small-batch regimes, whereas GN’s per-sample, per-group estimation maintains accuracy across all batch sizes (Wu et al., 2018, Habib et al., 2024). Unlike BN, GN does not require accumulation or synchronization of running statistics between training and inference.

3. Theoretical Perspectives and Spherical Normalization Framework

GN can be expressed within the "spherical normalization" framework, which interprets a group’s pre-activations as a vector in CC5 projected onto a sphere of radius CC6, removing the scale and mean (Sun et al., 2020). The affine parameters CC7 then "re-embed" the standardized vector. GN is invariant to groupwise scaling and shifting of pre-activations:

  • For any CC8,

CC9

where H×WH \times W0 is the all-ones vector.

This invariance constrains optimization to a compact manifold (sphere), leading to scale-invariant gradients and, if unchecked, monotonic weight norm growth during training. This phenomenon can increase adversarial vulnerability unless regularized with weight decay (Sun et al., 2020).

4. Empirical Results and Use Cases

GN was demonstrated to outperform or match BN in multiple settings, particularly when batch sizes are small or variable (Wu et al., 2018, Habib et al., 2024):

  • ImageNet Classification (ResNet-50): With batch size 2, GN achieves 24.1% top-1 error, while BN degrades to 34.7%. For standard batch sizes (H×WH \times W1), GN remains within 0.5% of BN.
  • COCO Detection/Segmentation: GN achieves higher APH×WH \times W2 (40.8 vs. 38.6) and APH×WH \times W3 (36.1 vs. 34.5) compared to BN in Mask R-CNN, especially when H×WH \times W4 per GPU.
  • Medical Imaging: In U-Net training for 2D biomedical semantic segmentation, fine-grained grouping (e.g., GN with H×WH \times W5, equivalent to IN) yields highest Dice coefficients, suggesting improved generalization over BN and LN in the presence of significant data heterogeneity and small batches (Zhou et al., 2018).
  • Few-Shot and Conditional Learning: Conditional GN, in which the affine parameters become functions of auxiliary conditioning variables, supports systematic generalization and domain-shift robustness in visual question answering and meta-learning benchmarks (Michalski et al., 2019).
  • Multimodal and RL Settings: Difficulty-aware GN (Durian) uses sample difficulty metrics (visual entropy, reasoning uncertainty) for regrouping, improving stability in multimodal reinforcement learning and yielding substantial gains over standard group-relative normalization (Li et al., 25 Feb 2026).

5. Practical Implementation and Hyperparameter Selection

Implementing GN typically involves selecting an appropriate group count H×WH \times W6. The consensus is:

  • Use H×WH \times W7 by default for H×WH \times W8.
  • If H×WH \times W9, select CC0 as a divisor of CC1, for example, CC2 or CC3 (Layer Norm).
  • GN’s computational overhead is minimal (within 5% of BN), requiring only group-wise mean and variance computations plus per-channel affine transformation (Wu et al., 2018, Habib et al., 2024).
  • Training GN models often accommodates higher learning rates than when using BN.
  • GN does not use or maintain population statistics for inference; the same normalization is applied at train and test time (Habib et al., 2024).
  • Code availability in frameworks: torch.nn.GroupNorm(num_groups=G, num_channels=C).

6. Limitations and Hybrid Approaches

While GN resolves batch-size dependence, several limitations exist:

  • Lack of BN’s Stochastic Regularization: GN’s deterministic computation can lead to less regularization compared to BN’s batch-statistics-induced noise, which is particularly beneficial in generative models (e.g., GANs) (Michalski et al., 2019).
  • Training Instability and Sensitivity: GN exhibits greater sensitivity to injected noise and weight-decay regularization, and does not consistently stabilize optimization throughout all training phases. Specifically, early-phase “loss landscape” flatness and gradient predictiveness can lag behind BN, with GN only providing a smoothing effect in the training mid-stage (Gunawan et al., 2022).
  • Hybrid Normalization: To address such instabilities, GN + BN hybrid layers (e.g., “GN-first sequential,” where BN is applied after GN and outputs are fused by a learned gate) combine the batch-size invariance of GN with BN’s regularization. These hybrids improve robustness and reduce performance variance over batch size, yielding accuracy gains on diverse datasets (Gunawan et al., 2022).

7. Extensions and Specialized Variants

GN’s framework is adaptable beyond the standard vision domain:

  • Difficulty-Aware GN: For reinforcement learning and multimodal LLMs, GN can be made difficulty-aware by regrouping samples via perceptual or reasoning metrics and normalizing within groups of homogeneous difficulty. This addresses collapse of variance when dealing with bimodal or highly polarized samples and stabilizes policy optimization (Li et al., 25 Feb 2026).
  • Conditional GN (CGN): CGN conditions the affine parameters CC4 on external meta-data (e.g., task embedding or class label), supporting domain adaptation, generalization to novel conditions, and controlled modulation of feature statistics (Michalski et al., 2019).

References

  • Wu & He, "Group Normalization" (Wu et al., 2018)
  • Sun et al., "New Interpretations of Normalization Methods in Deep Learning" (Sun et al., 2020)
  • Zhou et al., "Normalization in Training U-Net for 2D Biomedical Semantic Segmentation" (Zhou et al., 2018)
  • "Exploring the Efficacy of Group-Normalization in Deep Learning Models for Alzheimer's Disease Classification" (Habib et al., 2024)
  • "Understanding and Improving Group Normalization" (Gunawan et al., 2022)
  • "An Empirical Study of Batch Normalization and Group Normalization in Conditional Computation" (Michalski et al., 2019)
  • "Enhancing Multi-Modal LLMs Reasoning via Difficulty-Aware Group Normalization" (Li et al., 25 Feb 2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Group Normalization (GN).