ConvNeXt V2: Scalable ConvNet with GRN & FCMAE

Updated 8 March 2026

ConvNeXt V2 is a family of convolutional neural networks that integrates a Global Response Normalization layer and a fully convolutional masked autoencoder to achieve scalable performance across diverse applications.
The GRN layer enhances inter-channel competition by recalibrating features and preventing collapse during training, outperforming traditional normalization methods.
FCMAE pretraining enables efficient masked autoencoding by reconstructing missing patches, leading to superior accuracy in both discriminative and generative vision tasks.

ConvNeXt V2 denotes a family of convolutional neural network (ConvNet) architectures that integrate architectural innovations and self-supervised learning advancements, establishing state-of-the-art performance in both discriminative and generative vision tasks. Developed by Facebook AI Research, ConvNeXt V2 extends the core ConvNeXt paradigm by co-designing the backbone with a Global Response Normalization (GRN) layer and coupling it with a fully convolutional masked autoencoder (FCMAE) scheme, enabling competitive scaling from edge to large-scale setups. The ConvNeXt V2 framework has further inspired 3D variants for volumetric data analysis in domains such as medical image segmentation.

1. Architectural Foundations and Modifications

ConvNeXt V2 is built upon the macro-architecture of ConvNeXt V1, structured as a multi-stage ConvNet with hierarchical downsampling. Each stage comprises patch-embedding or downsampling convolutions with stride 2, followed by repeated ConvNeXt blocks. The standard block sequence includes depthwise convolutions, normalization, pointwise MLP expansion (expansion ratio of 4), GELU activation, and a residual connection. While V1 utilized LayerScale for per-channel learned scaling, V2 replaces this with the GRN layer, finding the former redundant once GRN is present.

The ConvNeXt V2 block can be denoted as follows:

Depthwise convolution ( $7 \times 7$ kernel in V2-2D; $3\times3\times3$ for 3D extensions)
Channel normalization (LayerNorm in 2D, InstanceNorm in certain 3D applications)
Pointwise expansion MLP (expansion ratio $R$ )
GELU activation
MLP projection back to original channel dimension
Global Response Normalization (GRN) layer
Residual skip connection

Model family scaling is realized by adjusting channel widths and block counts, covering variants from Atto (3.7M params) to Huge (659M params) in 2D, and by multi-axis “compound” scaling in 3D (Woo et al., 2023, Roy et al., 19 Dec 2025).

2. Global Response Normalization (GRN)

GRN is a central innovation that provides inter-channel lateral inhibition to prevent channel domination and collapse during training. Formally, for a feature tensor $X \in \mathbb{R}^{H\times W\times C}$ , the GRN layer computes:

Per-channel global $\ell_2$ response:

$G(X)_i = \left\| X_i \right\|_2 = \sqrt{\sum_{u=1}^H \sum_{v=1}^W (X_i(u,v))^2}$

Divisive normalization across channels:

$N(G(X)_i) = \frac{G(X)_i}{\sum_{j=1}^C G(X)_j}$

Re-calibration and residual update:

$X_i' = \gamma\, X_i\, N(G(X)_i) + \beta + X_i$

with $\gamma, \beta$ as learnable per-channel (or per-feature) scale/bias parameters (Woo et al., 2023). This mechanism is parameter-efficient, outperforming both batch normalization and gating (SE, CBAM) approaches with minimal parameter overhead. Analogous 3D formulations are used for volumetric data, preventing feature collapse when expansion ratios are high (Roy et al., 19 Dec 2025).

3. Fully-Convolutional Masked Autoencoder (FCMAE) Framework

ConvNeXt V2 is co-designed with a fully convolutional masked autoencoding pretraining protocol, extending masked autoencoder (MAE) techniques to ConvNets:

Inputs are decomposed into non-overlapping patches (e.g., $32\times32$ for $224\times224$ images), with 60% patches randomly masked.
The encoder processes only visible pixels via submanifold sparse convolution, avoiding mask tokens or positional encodings.
The decoder, a single ConvNeXt block, reconstructs masked patches from the encoded features and mask tokens.
The loss is applied only to masked patches:

$\mathcal{L}_{\rm FCMAE} = \frac{1}{|M|}\sum_{i\in M} \left\| x_i - \hat x_i \right\|_2^2$

Compared to transformer-based MAEs, FCMAE offers higher efficiency (1.7x speedup with single-block convolutional decoder) (Woo et al., 2023).

Key ablations indicate $r=0.6$ optimizes performance; both higher and lower masking rates degrade learning. FCMAE also obviates the need for tokenization and is robust to the absence of multi-layer positional encodings.

4. Scaling Strategies and Model Variants

The ConvNeXt V2 family achieves scalable performance from edge devices to large-scale tasks. Scaling is performed by jointly increasing channel widths, block depths, and spatial context. In 2D, variants such as Atto, Femto, Pico, Nano, Tiny, Base, Large, and Huge correspond to parameter counts from 3.7M to 659M, with top-1 ImageNet-1K accuracy spanning from 76.7% to 85.8%. With intermediate ImageNet-22K pretraining and enlarged input temporal context (e.g., $512^2$ ), the Huge model achieves up to 88.9% top-1 accuracy (Woo et al., 2023).

In the 3D domain, scaling extends along three orthogonal axes: depth (number of layers), width (channel base), and context (patch size). For instance, “compound” scaling for 3D medical segmentation leverages $P_0=128^3$ for pretraining and $P=192^3$ for context-enriched fine-tuning, yielding further accuracy gains efficiently (Roy et al., 19 Dec 2025).

Model Variant	Params (M)	Top-1 IN-1K (%)	ADE20K mIoU (%)
Atto	3.7	76.7	–
Base	89	84.6	52.1
Large	198	85.6	53.7
Huge	659	85.8–88.9*	55.0

*With intermediate ImageNet-22K pretraining and increased resolution.

5. Empirical Evaluation and Analysis

Joint architectural and algorithmic advances yield distinctive empirical outcomes. Co-designing ConvNeXt V2 with FCMAE secures >0.8% top-1 accuracy improvements over either innovation alone. Ablations showing the inclusion of GRN deliver measurable gains: for 2D, 84.6% (GRN) vs 83.8% (LayerNorm); for 3D, +0.29 mean-DSC improvement before pretraining. Comparisons demonstrate that ConvNeXt V2 avoids feature space collapse and supports high channel diversity, which correlates with superior downstream transfer (Woo et al., 2023, Roy et al., 19 Dec 2025).

In medical semantic segmentation benchmarks, 3D ConvNeXt V2 backbones (MedNeXt-V2) outperform previous backbones and transformer hybrids. For example, MedNeXt-V2 (Patch × 1.5) attains mean DSC 83.70 and NSD 81.77, outperforming next-best CADS by +1.15 DSC and +1.46 NSD; even the unscaled “Base” variant delivers +0.48 DSC over all public pretrained competitors. Improvements are directly attributed to GRN, expanded micro-architecture, and the compound scaling protocol (Roy et al., 19 Dec 2025).

6. Training Protocols and Practical Recommendations

Training ConvNeXt V2 models, both in 2D and 3D settings, utilizes AdamW optimization, cosine learning rate scheduling, and warm-up phases for large models. Deep supervision and data augmentation strategies adjust based on model size. For FCMAE, pretraining typically employs 800 to 1600 epochs on ImageNet-1K; fine-tuning protocols are adapted according to compute constraints and target tasks.

For resource-limited environments, smaller variants (Atto, Femto) deliver strong performance with modest computational requirements. On standard accelerators, Tiny and Base offer favorable tradeoffs. Large-scale and high-accuracy scenarios utilize the Large and Huge variants, with intermediate 22K-image pretraining and increased spatial context during fine-tuning recommended for optimal performance (Woo et al., 2023, Roy et al., 19 Dec 2025).

In volumetric domains, practical scaling should select a base configuration (e.g., 52 layers, $C_0$ channels, small base patch), benchmark several backbones on diverse small-scale datasets, and scale depth, width, or context as dictated by GPU memory budget and downstream task complexity. Employing large-patch context at fine-tuning time is efficient and advantageous; pretraining on task-relevant modalities is generally unnecessary if full fine-tuning will be applied, as demonstrated in medical image segmentation (Roy et al., 19 Dec 2025).

7. Extensions and Broader Implications

MedNeXt-V2 operationalizes ConvNeXt V2 principles for 3D medical image analysis, inserting a 3D GRN module in every block and deploying compound scaling for large-scale volumetric data. The study establishes that backbone strength at initialization is predictive of downstream transfer, challenging the convention of focusing solely on dataset size. In both 2D and 3D, ConvNeXt V2’s design enables robust and scalable training, supporting state-of-the-art transfer across multiple visual domains. Modular open-sourced implementations facilitate reproducibility and adaptation (see official repositories for both ConvNeXt V2 and MedNeXt-V2) (Woo et al., 2023, Roy et al., 19 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (2)

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders (2023)

MedNeXt-v2: Scaling 3D ConvNeXts for Large-Scale Supervised Representation Learning in Medical Image Segmentation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ConvNeXt V2.

ConvNeXt V2: Scalable ConvNet with GRN & FCMAE

1. Architectural Foundations and Modifications

2. Global Response Normalization (GRN)

3. Fully-Convolutional Masked Autoencoder (FCMAE) Framework

4. Scaling Strategies and Model Variants

5. Empirical Evaluation and Analysis

6. Training Protocols and Practical Recommendations

7. Extensions and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ConvNeXt V2: Scalable ConvNet with GRN & FCMAE

1. Architectural Foundations and Modifications

2. Global Response Normalization (GRN)

3. Fully-Convolutional Masked Autoencoder (FCMAE) Framework

4. Scaling Strategies and Model Variants

5. Empirical Evaluation and Analysis

6. Training Protocols and Practical Recommendations

7. Extensions and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research