3D Global Response Normalization (GRN)

Updated 16 April 2026

3D Global Response Normalization (GRN) is a channel-wise technique that computes the global L₂-norm of feature channels to calibrate activations in volumetric and structured data.
It adaptively reweights activations using learnable scale (γ) and bias (β) parameters to mitigate channel dominance and promote balanced feature utilization.
Integrated in architectures like MedNeXt-v2 and Flex-GCN, GRN has been shown to improve performance in 3D medical imaging and human pose estimation benchmarks.

3D Global Response Normalization (GRN) is a channel-wise normalization and gating technique specifically designed for volumetric and structured data, such as 3D medical images and 3D geometric features in human pose estimation. GRN replaces or augments conventional normalization layers by aggregating global per-channel responses and applying learnable affine calibration. Its core mechanism is based on computing the global L₂-norm of each feature channel across all spatial or graph nodes, normalizing these responses to limit channel dominance, and adaptively reweighting channel activations. This yields improved representation diversity, robust information propagation in deep networks, and consistent performance gains in state-of-the-art architectures for 3D applications (Shahjahan et al., 2024, Roy et al., 19 Dec 2025).

1. Formal Definition and Mathematical Formulation

3D Global Response Normalization operates on an input activation tensor $X \in \mathbb{R}^{B \times C \times H \times W \times D}$ (for 3D volumetric data) or a node-feature matrix $H \in \mathbb{R}^{N \times C}$ (for structured graphs). The core computations are as follows:

Volumetric (3D CNN) Formulation (Roy et al., 19 Dec 2025):

Compute per-sample, per-channel L₂-norm over the spatial domain:

$g_{b,i} = \sqrt{ \sum_{h=1}^H \sum_{w=1}^W \sum_{d=1}^D [X_{b,i}(h,w,d)]^2 + \varepsilon }$

Compute channel-sum for normalization:

$s_b = \sum_{j=1}^C g_{b,j} + \varepsilon$

Compute normalized response:

$n_{b,i} = \frac{g_{b,i}}{s_b}$

Apply learnable channel re-scaling and shift with residual skip:

$Y_{b,i}(h,w,d) = \gamma_i \, n_{b,i} X_{b,i}(h,w,d) + \beta_i + X_{b,i}(h,w,d)$

where $\gamma_i$ , $\beta_i \in \mathbb{R}$ are learnable per-channel scale and bias.

Structured Graph Formulation (Shahjahan et al., 2024):

Compute per-channel magnitude for node features:

$r_c = \left( \frac{1}{N} \sum_{i=1}^N H_{i,c}^2 \right)^{1/2}$

Compute mean across channels:

$\mu = \frac{1}{C}\sum_{k=1}^C r_k + \varepsilon$

Normalize and calibrate:

$H \in \mathbb{R}^{N \times C}$ 0

$H \in \mathbb{R}^{N \times C}$ 1

$H \in \mathbb{R}^{N \times C}$ 2

This process delivers adaptive, global self-gating on a per-channel basis, encouraging balanced channel utilization.

2. Distinction from Classical Normalization Layers

GRN is fundamentally distinct from standard normalization strategies such as BatchNorm, LayerNorm, and InstanceNorm:

Method	Reduces by	Statistic Type	Learnable Params
BatchNorm	Batch × Spat	Mean/variance (zero mean/unit var)	Scale + shift
LayerNorm	Channel/spat	Mean/variance (zero mean/unit var)	Scale + shift
InstanceNorm	Spat	Mean/variance (zero mean/unit var)	Scale + shift
3D GRN	Channel	L₂-norm (global response)	Channel-wise $H \in \mathbb{R}^{N \times C}$ 3

GRN does not subtract channel means nor divide by per-channel variances. Instead, it normalizes per-channel global magnitude and reweights features accordingly. No batch statistics or momentum are used, and the operation serves as a channel-response limiter promoting more uniform information flow and reducing channel collapse (i.e., dead or saturated channels), which is particularly pertinent in deep, high-capacity expansions (Roy et al., 19 Dec 2025).

3. Placement within Network Architectures

MedNeXt-v2 Block (3D CNN)

In MedNeXt-v2 (Roy et al., 19 Dec 2025), GRN is integrated after the activation function following the channel expansion:

Depthwise 3×3×3 convolution
InstanceNorm3D
Pointwise 1×1×1 expansion convolution (to $H \in \mathbb{R}^{N \times C}$ 4 channels)
GELU activation
3D GRN
Pointwise 1×1×1 compression convolution (back to $H \in \mathbb{R}^{N \times C}$ 5 channels)
Residual addition

GRN is applied once per block, immediately after the feature dimension expansion, ensuring effective channel competition before recompression.

Flex-GCN Pipeline (Graph Data)

In Flex-GCN for 3D human pose estimation (Shahjahan et al., 2024), GRN is positioned after all graph-convolutional residual blocks and before the final “lifting” layer that outputs 3D joint predictions:

Input 2D joint positions
Initial Flexible Graph Convolution + GELU
4 stacked residual Flex-GConv blocks (each with 3 Flex-GConvs and LayerNorm/GELU)
Global Response Normalization (GRN)
Final Flex-GConv (“lifting” to 3D output)

Placement after the residual stack allows GRN to adaptively amplify or attenuate global features before decoding or regression to target outputs.

4. Computational Complexity and Implementation

The computational cost of 3D GRN is lightweight, incurring only $H \in \mathbb{R}^{N \times C}$ 6 (for $H \in \mathbb{R}^{N \times C}$ 7 nodes/spatial elements, $H \in \mathbb{R}^{N \times C}$ 8 channels) or $H \in \mathbb{R}^{N \times C}$ 9 (for a $g_{b,i} = \sqrt{ \sum_{h=1}^H \sum_{w=1}^W \sum_{d=1}^D [X_{b,i}(h,w,d)]^2 + \varepsilon }$ 0 feature map). The memory overhead is limited to per-channel parameters ( $g_{b,i} = \sqrt{ \sum_{h=1}^H \sum_{w=1}^W \sum_{d=1}^D [X_{b,i}(h,w,d)]^2 + \varepsilon }$ 1, $g_{b,i} = \sqrt{ \sum_{h=1}^H \sum_{w=1}^W \sum_{d=1}^D [X_{b,i}(h,w,d)]^2 + \varepsilon }$ 2) and intermediate statistics.

PyTorch-style pseudo-implementations for both settings:

Graph variant (Shahjahan et al., 2024):

$s_b = \sum_{j=1}^C g_{b,j} + \varepsilon$ 2

3D variant (Roy et al., 19 Dec 2025):

$s_b = \sum_{j=1}^C g_{b,j} + \varepsilon$ 3

No running statistics, non-linearities, or momenta are employed internally.

5. Hyperparameter Choices and Initialization

Stabilization constant $g_{b,i} = \sqrt{ \sum_{h=1}^H \sum_{w=1}^W \sum_{d=1}^D [X_{b,i}(h,w,d)]^2 + \varepsilon }$ 3: typically $g_{b,i} = \sqrt{ \sum_{h=1}^H \sum_{w=1}^W \sum_{d=1}^D [X_{b,i}(h,w,d)]^2 + \varepsilon }$ 4 or $g_{b,i} = \sqrt{ \sum_{h=1}^H \sum_{w=1}^W \sum_{d=1}^D [X_{b,i}(h,w,d)]^2 + \varepsilon }$ 5, to avoid division by zero.
Learnable channel-wise scale $g_{b,i} = \sqrt{ \sum_{h=1}^H \sum_{w=1}^W \sum_{d=1}^D [X_{b,i}(h,w,d)]^2 + \varepsilon }$ 6: initialized to $g_{b,i} = \sqrt{ \sum_{h=1}^H \sum_{w=1}^W \sum_{d=1}^D [X_{b,i}(h,w,d)]^2 + \varepsilon }$ 7.
Learnable channel-wise bias $g_{b,i} = \sqrt{ \sum_{h=1}^H \sum_{w=1}^W \sum_{d=1}^D [X_{b,i}(h,w,d)]^2 + \varepsilon }$ 8: initialized to $g_{b,i} = \sqrt{ \sum_{h=1}^H \sum_{w=1}^W \sum_{d=1}^D [X_{b,i}(h,w,d)]^2 + \varepsilon }$ 9.
GRN is instantiated once per block after expansion/GELU (MedNeXt-v2), or after the residual stack (Flex-GCN).
No additional clipping or nonlinear gating is employed unless explicitly desired, though clamping on $s_b = \sum_{j=1}^C g_{b,j} + \varepsilon$ 0 or applying sigmoidal gating is possible.

GRN does not replace preceding normalization (e.g., InstanceNorm may still be present) but acts as a dedicated channel-response calibrator.

6. Empirical Impact and Effectiveness

Quantitative ablation studies and cross-architecture benchmarks confirm that 3D GRN provides measurable improvements in accuracy and robustness:

In Flex-GCN (Shahjahan et al., 2024), GRN yields a 5.1% relative reduction in MPJPE (46.9 mm vs. 49.4 mm) on Human3.6M (Protocol 1), a 1.3% improvement for PA-MPJPE (38.6 mm vs. 39.1 mm), and a 2–3% gain in PCK and AUC on MPI-INF-3DHP. On occlusion-heavy motions, GRN reduces errors by up to 8%.
In MedNeXt-v2 (Roy et al., 19 Dec 2025), the sole change from v1 to v2 is the insertion of 3D GRN, resulting in a 0.29 percentage point mean Dice gain over four benchmarks (BTCV, AMOS, KiTS, ACDC), and a consistent reduction in “dead” or saturated channels in early feature maps (assessed visually).

GRN’s gating mechanism is particularly effective for promoting global co-occurrence patterns in pose estimation and for reinforcing channel diversity in high-capacity volumetric segmentation models, improving convergence, robustness under occlusion/ambiguity, and representation quality.

7. Practical Significance and Recommendations

3D Global Response Normalization is a generic, lightweight module for channel competition and calibration in 3D models. Its minimal computational and memory footprint, independence from batch statistics, and compatibility with both graph and volumetric convolutional backbones make it suitable for deployment in large-scale supervised learning regimes, particularly where channel imbalance and overfitting of deep expansions are concerns.

Recommended best practices include: placement immediately after an expansion/convolutional activation in every block; initializing scaling and bias as identity; maintaining a small $s_b = \sum_{j=1}^C g_{b,j} + \varepsilon$ 1; and avoiding interference with other normalization layers reliant on running moments. Empirical evidence supports its routine inclusion for improved training stability and consistent downstream performance gains in both pose estimation and medical segmentation contexts (Shahjahan et al., 2024, Roy et al., 19 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Flexible graph convolutional network for 3D human pose estimation (2024)

MedNeXt-v2: Scaling 3D ConvNeXts for Large-Scale Supervised Representation Learning in Medical Image Segmentation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to 3D Global Response Normalization (GRN).