ConvNeXt V2: Scalable ConvNet with GRN & FCMAE
- ConvNeXt V2 is a family of convolutional neural networks that integrates a Global Response Normalization layer and a fully convolutional masked autoencoder to achieve scalable performance across diverse applications.
- The GRN layer enhances inter-channel competition by recalibrating features and preventing collapse during training, outperforming traditional normalization methods.
- FCMAE pretraining enables efficient masked autoencoding by reconstructing missing patches, leading to superior accuracy in both discriminative and generative vision tasks.
ConvNeXt V2 denotes a family of convolutional neural network (ConvNet) architectures that integrate architectural innovations and self-supervised learning advancements, establishing state-of-the-art performance in both discriminative and generative vision tasks. Developed by Facebook AI Research, ConvNeXt V2 extends the core ConvNeXt paradigm by co-designing the backbone with a Global Response Normalization (GRN) layer and coupling it with a fully convolutional masked autoencoder (FCMAE) scheme, enabling competitive scaling from edge to large-scale setups. The ConvNeXt V2 framework has further inspired 3D variants for volumetric data analysis in domains such as medical image segmentation.
1. Architectural Foundations and Modifications
ConvNeXt V2 is built upon the macro-architecture of ConvNeXt V1, structured as a multi-stage ConvNet with hierarchical downsampling. Each stage comprises patch-embedding or downsampling convolutions with stride 2, followed by repeated ConvNeXt blocks. The standard block sequence includes depthwise convolutions, normalization, pointwise MLP expansion (expansion ratio of 4), GELU activation, and a residual connection. While V1 utilized LayerScale for per-channel learned scaling, V2 replaces this with the GRN layer, finding the former redundant once GRN is present.
The ConvNeXt V2 block can be denoted as follows:
- Depthwise convolution ( kernel in V2-2D; for 3D extensions)
- Channel normalization (LayerNorm in 2D, InstanceNorm in certain 3D applications)
- Pointwise expansion MLP (expansion ratio )
- GELU activation
- MLP projection back to original channel dimension
- Global Response Normalization (GRN) layer
- Residual skip connection
Model family scaling is realized by adjusting channel widths and block counts, covering variants from Atto (3.7M params) to Huge (659M params) in 2D, and by multi-axis “compound” scaling in 3D (Woo et al., 2023, Roy et al., 19 Dec 2025).
2. Global Response Normalization (GRN)
GRN is a central innovation that provides inter-channel lateral inhibition to prevent channel domination and collapse during training. Formally, for a feature tensor , the GRN layer computes:
- Per-channel global response:
- Divisive normalization across channels:
- Re-calibration and residual update:
with as learnable per-channel (or per-feature) scale/bias parameters (Woo et al., 2023). This mechanism is parameter-efficient, outperforming both batch normalization and gating (SE, CBAM) approaches with minimal parameter overhead. Analogous 3D formulations are used for volumetric data, preventing feature collapse when expansion ratios are high (Roy et al., 19 Dec 2025).
3. Fully-Convolutional Masked Autoencoder (FCMAE) Framework
ConvNeXt V2 is co-designed with a fully convolutional masked autoencoding pretraining protocol, extending masked autoencoder (MAE) techniques to ConvNets:
- Inputs are decomposed into non-overlapping patches (e.g., for images), with 60% patches randomly masked.
- The encoder processes only visible pixels via submanifold sparse convolution, avoiding mask tokens or positional encodings.
- The decoder, a single ConvNeXt block, reconstructs masked patches from the encoded features and mask tokens.
- The loss is applied only to masked patches:
Compared to transformer-based MAEs, FCMAE offers higher efficiency (1.7x speedup with single-block convolutional decoder) (Woo et al., 2023).
Key ablations indicate optimizes performance; both higher and lower masking rates degrade learning. FCMAE also obviates the need for tokenization and is robust to the absence of multi-layer positional encodings.
4. Scaling Strategies and Model Variants
The ConvNeXt V2 family achieves scalable performance from edge devices to large-scale tasks. Scaling is performed by jointly increasing channel widths, block depths, and spatial context. In 2D, variants such as Atto, Femto, Pico, Nano, Tiny, Base, Large, and Huge correspond to parameter counts from 3.7M to 659M, with top-1 ImageNet-1K accuracy spanning from 76.7% to 85.8%. With intermediate ImageNet-22K pretraining and enlarged input temporal context (e.g., ), the Huge model achieves up to 88.9% top-1 accuracy (Woo et al., 2023).
In the 3D domain, scaling extends along three orthogonal axes: depth (number of layers), width (channel base), and context (patch size). For instance, “compound” scaling for 3D medical segmentation leverages for pretraining and for context-enriched fine-tuning, yielding further accuracy gains efficiently (Roy et al., 19 Dec 2025).
| Model Variant | Params (M) | Top-1 IN-1K (%) | ADE20K mIoU (%) |
|---|---|---|---|
| Atto | 3.7 | 76.7 | – |
| Base | 89 | 84.6 | 52.1 |
| Large | 198 | 85.6 | 53.7 |
| Huge | 659 | 85.8–88.9* | 55.0 |
*With intermediate ImageNet-22K pretraining and increased resolution.
5. Empirical Evaluation and Analysis
Joint architectural and algorithmic advances yield distinctive empirical outcomes. Co-designing ConvNeXt V2 with FCMAE secures >0.8% top-1 accuracy improvements over either innovation alone. Ablations showing the inclusion of GRN deliver measurable gains: for 2D, 84.6% (GRN) vs 83.8% (LayerNorm); for 3D, +0.29 mean-DSC improvement before pretraining. Comparisons demonstrate that ConvNeXt V2 avoids feature space collapse and supports high channel diversity, which correlates with superior downstream transfer (Woo et al., 2023, Roy et al., 19 Dec 2025).
In medical semantic segmentation benchmarks, 3D ConvNeXt V2 backbones (MedNeXt-V2) outperform previous backbones and transformer hybrids. For example, MedNeXt-V2 (Patch × 1.5) attains mean DSC 83.70 and NSD 81.77, outperforming next-best CADS by +1.15 DSC and +1.46 NSD; even the unscaled “Base” variant delivers +0.48 DSC over all public pretrained competitors. Improvements are directly attributed to GRN, expanded micro-architecture, and the compound scaling protocol (Roy et al., 19 Dec 2025).
6. Training Protocols and Practical Recommendations
Training ConvNeXt V2 models, both in 2D and 3D settings, utilizes AdamW optimization, cosine learning rate scheduling, and warm-up phases for large models. Deep supervision and data augmentation strategies adjust based on model size. For FCMAE, pretraining typically employs 800 to 1600 epochs on ImageNet-1K; fine-tuning protocols are adapted according to compute constraints and target tasks.
For resource-limited environments, smaller variants (Atto, Femto) deliver strong performance with modest computational requirements. On standard accelerators, Tiny and Base offer favorable tradeoffs. Large-scale and high-accuracy scenarios utilize the Large and Huge variants, with intermediate 22K-image pretraining and increased spatial context during fine-tuning recommended for optimal performance (Woo et al., 2023, Roy et al., 19 Dec 2025).
In volumetric domains, practical scaling should select a base configuration (e.g., 52 layers, channels, small base patch), benchmark several backbones on diverse small-scale datasets, and scale depth, width, or context as dictated by GPU memory budget and downstream task complexity. Employing large-patch context at fine-tuning time is efficient and advantageous; pretraining on task-relevant modalities is generally unnecessary if full fine-tuning will be applied, as demonstrated in medical image segmentation (Roy et al., 19 Dec 2025).
7. Extensions and Broader Implications
MedNeXt-V2 operationalizes ConvNeXt V2 principles for 3D medical image analysis, inserting a 3D GRN module in every block and deploying compound scaling for large-scale volumetric data. The study establishes that backbone strength at initialization is predictive of downstream transfer, challenging the convention of focusing solely on dataset size. In both 2D and 3D, ConvNeXt V2’s design enables robust and scalable training, supporting state-of-the-art transfer across multiple visual domains. Modular open-sourced implementations facilitate reproducibility and adaptation (see official repositories for both ConvNeXt V2 and MedNeXt-V2) (Woo et al., 2023, Roy et al., 19 Dec 2025).