Papers
Topics
Authors
Recent
Search
2000 character limit reached

Global Response Normalization (GRN)

Updated 21 May 2026
  • Global Response Normalization (GRN) is a channel-wise normalization technique that aggregates, normalizes, and reweights global activations to enforce inter-channel competition and prevent feature collapse.
  • GRN is integrated into ConvNeXt V2, replacing the LayerScale component to achieve up to 1.3% accuracy improvements on benchmarks like ImageNet-1K, COCO, and ADE20K.
  • Operating without extra parameters or computational overhead, GRN effectively addresses channel co-adaptation issues during masked autoencoder pre-training, ensuring robust feature representation.

Global Response Normalization (GRN) is a response normalization method introduced in "ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders" to improve channel diversity and feature representation in convolutional neural networks, particularly under masked autoencoder (MAE) pre-training regimes. GRN operates by aggregating, normalizing, and reweighting global channel activations, addressing collapse issues and enhancing competitive channel dynamics in the ConvNeXt V2 architecture (Woo et al., 2023).

1. Motivation and Problem Setting

During fully convolutional masked autoencoder (FCMAE) pre-training of ConvNeXt architectures, it was observed that MLP layers suffered severe feature collapse: numerous channels became inactive (all zeros) or saturated, and the average pairwise cosine distance between channel vectors approached zero. The underlying cause is the lack of explicit mechanisms in the masked reconstruction loss to promote inter-channel competition or orthogonalization, leading to channel co-adaptation and diminished representational capacity. GRN was motivated by the biological principle of lateral inhibition, aiming to enforce competition among feature channels. This ensures differentiated responses, thus restoring feature diversity and supporting transferability in large-scale recognition benchmarks (Woo et al., 2023).

2. Formal Definition

Let the input to the GRN layer be an activation tensor X∈RH×W×CX \in \mathbb{R}^{H \times W \times C}, where Xi∈RH×WX_{i} \in \mathbb{R}^{H \times W} denotes the spatial activation map of channel ii:

  1. Global Aggregation: For each channel ii, compute the global L2L_2-norm:

G(X)=(∥X1∥2, …, ∥XC∥2)∈RC\mathcal{G}(X) = \left(\|X_1\|_2,\,\ldots,\,\|X_C\|_2\right) \in \mathbb{R}^C

  1. Divisive Normalization: Normalize each channel's global response by the sum over all channels:

N(G(X)i)=∥Xi∥2∑j=1C∥Xj∥2+ε\mathcal{N}(\mathcal{G}(X)_i) = \frac{\|X_i\|_2}{\sum_{j=1}^C \|X_j\|_2 + \varepsilon}

Typically, ε=10−6\varepsilon=10^{-6} is used for numerical stability.

  1. Feature Calibration: Each channel is scaled by its normalized response:

X~i=Xi×N(G(X)i)\widetilde{X}_i = X_i \times \mathcal{N}(\mathcal{G}(X)_i)

  1. Affine Scaling and Residual Addition: Final GRN output adds a learnable affine transformation and a residual skip connection:

Yi=γiX~i+βi+XiY_i = \gamma_i \widetilde{X}_i + \beta_i + X_i

Here, Xi∈RH×WX_{i} \in \mathbb{R}^{H \times W}0 are learnable and initialized to zero, so GRN initially acts as the identity.

This formulation facilitates deterministic, per-batch normalization and feature calibration while retaining computational simplicity.

3. Integration in ConvNeXt V2 Architectures

In ConvNeXt V2, the forward block structure is modified to accommodate GRN and remove the pre-existing LayerScale component. The canonical flow within a ConvNeXt V2 block is:

  1. Depthwise convolution (DW-Conv)
  2. LayerNorm (LN)
  3. 1x1 pointwise convolution to expand channels
  4. GELU nonlinearity
  5. Global Response Normalization (GRN)
  6. 1x1 pointwise convolution projecting back to original channel dimension
  7. Residual connection

GRN is placed immediately after the combination of pointwise expansion and nonlinearity. It is applied in both MAE pre-training and all downstream fine-tuning stages, with ablations confirming that its removal at any point causes notable accuracy drops (Woo et al., 2023).

4. Algorithmic Outline and Hyperparameters

The GRN operation is efficiently implemented, requiring no extra convolutions or multi-layer perceptrons. PyTorch-style pseudocode for GRN per batch is:

ii0

  • Xi∈RH×WX_{i} \in \mathbb{R}^{H \times W}1 is set to Xi∈RH×WX_{i} \in \mathbb{R}^{H \times W}2 for stability.
  • Xi∈RH×WX_{i} \in \mathbb{R}^{H \times W}3 and Xi∈RH×WX_{i} \in \mathbb{R}^{H \times W}4 are learnable and channelwise, initialized at zero, so GRN is an identity map at initialization.
  • No running averages or momentum statistics are maintained; normalization is purely deterministic within each batch.

5. Comparison with Alternative Normalization and Gating Methods

Empirical comparisons on ImageNet-1K (ConvNeXt-Base + FCMAE, 800 epochs) demonstrate GRN’s effectiveness relative to classical normalization and gating mechanisms:

Method Top-1 Acc. (%) Additional Parameters
Baseline (no GRN) 83.7 0
Local Response Norm 83.2 0
BatchNorm 80.5 0
LayerNorm 83.8 0
SE (Hu et al.) 84.4 +20M
CBAM (Woo et al.) 84.5 +20M
GRN 84.6 0

GRN surpasses LayerNorm by 0.8 points, BatchNorm by 4.1, and non-learned LRN by 1.4, while achieving similar or higher accuracy than SE and CBAM attention modules at zero parameter overhead (Woo et al., 2023).

6. Computational Considerations

GRN is highly efficient:

  • Adds only two length-Xi∈RH×WX_{i} \in \mathbb{R}^{H \times W}5 vectors (Xi∈RH×WX_{i} \in \mathbb{R}^{H \times W}6, Xi∈RH×WX_{i} \in \mathbb{R}^{H \times W}7) as learnable parameters.
  • No extra convolutions or network depth.
  • Single global norm, sum, and channel-wise multiplication per batch.
  • No increase in FLOPs or memory compared to ConvNeXt V1; practical runtime cost is negligible (implemented in approximately three code lines).

7. Empirical Performance Impact

GRN yields consistent and significant accuracy improvements across major computer vision benchmarks when co-designed with FCMAE pre-training in ConvNeXt V2:

  • ImageNet-1K (Base):
    • V1 + FCMAE: 83.7%
    • V2 + FCMAE: 84.6% (+0.9)
  • Co-design ablation (Table 5):
    • V1 supervised (300 ep): 83.8%
    • V1 + FCMAE: 83.7%
    • V2 supervised: 84.3%
    • V2 + FCMAE: 84.6% (+0.8 over V1 supervised)
  • ConvNeXt Large:
    • V1 supervised: 84.3%
    • V2 + FCMAE: 85.6% (+1.3)
  • COCO Object Detection (Mask R-CNN):
    • V1-B sup: 50.3 box-AP
    • V2-B + FCMAE: 52.9 (+2.6)
  • ADE20K Semantic Segmentation (UPerNet):
    • V1-B sup: 49.9 mIoU
    • V2-B + FCMAE: 52.1 (+2.2)
    • V2-H + FCMAE: 55.0 mIoU

GRN is therefore critical for reviving channel diversity during masked autoencoder pre-training and consistently delivers Xi∈RH×WX_{i} \in \mathbb{R}^{H \times W}8 to Xi∈RH×WX_{i} \in \mathbb{R}^{H \times W}9 percentage-point improvements on ImageNet and even larger gains on transfer tasks (Woo et al., 2023).


In summary, Global Response Normalization is a simple, zero-overhead channel-wise normalization layer central to the ConvNeXt V2 design. Through deterministic aggregation, normalization, and feature calibration, it enforces inter-channel competition, counteracts feature collapse under MAE pre-training, and unlocks state-of-the-art performance without incurring extra computational or parameter cost (Woo et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Global Response Normalization (GRN).