ConvNeXt: Evolving CNNs for Vision Tasks

Updated 27 January 2026

ConvNeXt is a family of CNN architectures that modernizes traditional models with transformer-inspired design choices and robust training practices.
It employs a four-stage hierarchical structure with advanced convolutional blocks, layer normalization, and stochastic depth for efficient performance.
Variants like ConvNeXt V2 and E-ConvNeXt demonstrate enhanced accuracy and scalability across image classification, detection, segmentation, and specialized applications.

ConvNeXt is a family of pure convolutional neural network (CNN) architectures developed to match the performance and scalability of vision transformers (ViTs) by systematically adopting and integrating modern architectural and training refinements. It retains the fundamental advantages of convolutional priors while incorporating micro- and macro-level design choices, normalization, and optimization strategies inspired by the transformer paradigm, resulting in a highly competitive backbone for image classification, detection, segmentation, and specialized domains.

1. Architectural Foundations and Modernization

ConvNeXt originated from a comprehensive investigation into which elements of hierarchical ViT architectures confer empirical gains and how these can be ported to a convolutional framework (Liu et al., 2022). Traditional ResNets were incrementally “modernized” by:

Adopting hierarchical stage compute ratios and deepening the network (e.g., (3,3,9,3) block configuration, mirroring Swin-T).
Introducing a "patchify" stem with a 4×4 stride-4 convolution rather than the original 7×7 stride-2 plus max pooling.
Utilizing depthwise convolutions with increased kernel size (e.g., 7×7) and expanding channels, as in ResNeXt.
Inserting inverted bottleneck structures—ordering operations as depthwise conv → expansion → activation → projection.
Employing a single GELU nonlinearity per block and a single pre-conv LayerNorm (over channels).
Replacing BatchNorm with LayerNorm and separating downsampling layers.
Standardizing optimization recipes (AdamW, strong data augmentation, stochastic depth, label smoothing, EMA).

This modernization enables ConvNeXt to match or exceed Swin Transformer and similar architectures in ImageNet-1K and downstream tasks with efficient, hardware-friendly CNN operations (Liu et al., 2022).

2. Core ConvNeXt Block and Mathematical Structure

A ConvNeXt block operates on $x\in\mathbb{R}^{H\times W\times C}$ and consists of:

LayerNorm: Applied channel-wise at each spatial position.
Depthwise 7×7 convolution: $\mathrm{DW}_7(\cdot)$ , padding 3.
Pointwise MLP: Two 1×1 convolutions—first expands dimensions ( $C\to4C$ ), GELU activation, then projects back ( $4C\to C$ ).
Residual and LayerScale: The output is scaled by a learned $\gamma_l$ and added to the input via a DropPath stochastic depth operation.

Mathematically:

$x_{l+1} = x_l + \mathrm{DP}\left(\gamma_l \cdot P_2(\mathrm{GELU}(P_1(\mathrm{DW}_7(\mathrm{LN}(x_l)))))\right)$

This block is the workhorse of all ConvNeXt backbones (Liu et al., 2022).

3. Macro-Architecture, Scaling, and Variants

ConvNeXt uses a four-stage hierarchy, each stage downsampling spatially and increasing channel width, with the following canonical configurations (Liu et al., 2022):

Model	Channels (C₁,C₂,C₃,C₄)	Blocks (B₁,B₂,B₃,B₄)	#Params	FLOPs@224²	Top-1@224
ConvNeXt-T	(96,192,384,768)	(3,3,9,3)	29 M	4.5 G	82.1 %
ConvNeXt-S	(96,192,384,768)	(3,3,27,3)	50 M	8.7 G	83.1 %
ConvNeXt-B	(128,256,512,1024)	(3,3,27,3)	89 M	15.4 G	83.8 %
ConvNeXt-L	(192,384,768,1536)	(3,3,27,3)	198 M	34.4 G	84.3 %
ConvNeXt-XL	(256,512,1024,2048)	(3,3,27,3)	350 M	60.9 G	87.0 %*

* XL trained on ImageNet-22K → 1K.

When pre-trained on ImageNet-22K and fine-tuned, top-1 accuracy for B/L/XL reaches 86.8/87.5/87.8%, matching or surpassing Swin Transformer and other strong backbones at equivalent computational budgets. The architecture features a convolutional stem, four ConvNeXt stages with explicit downsampling, and a classification head (Liu et al., 2022).

4. Training Procedures and Best Practices

ConvNeXt employs a transformer-inspired training recipe:

AdamW optimizer (β₁=0.9, β₂=0.999), base learning rate 4×10⁻³ (cosine decay, 20-epoch warmup), weight decay 0.05.
Data augmentations: RandAugment, MixUp, CutMix, Random Erasing.
Regularization: label smoothing 0.1, stochastic depth (depth-scaled), LayerScale initialization 1e-6, EMA=0.9999.
Fine-tuning and transfer learning: adapted learning rate schedules and strong augmentation persist in downstream tasks.
For ImageNet-22K pre-training: same regimen, 90 epochs, no EMA.

This training strategy is decisive for the competitive accuracy and stability observed across tasks (Liu et al., 2022).

5. Extensions: ConvNeXt V2 and Efficient Lightweight Variants

ConvNeXt V2: Co-design with Self-supervision and Global Response Normalization

ConvNeXt V2 integrates a fully-convolutional masked autoencoder (FCMAE) with the core architecture and a novel Global Response Normalization (GRN) layer. The GRN, inserted post-MLP/pre-projection, enforces inter-channel competition and mitigates feature collapse during self-supervised MAE pre-training (Woo et al., 2023). Model scaling ranges from Atto (3.7M, 76.7%) to Huge (659M, 88.9%). Key advances include:

FCMAE: Adapts MAE for pure CNNs using sparse convolutions for efficiency.
GRN: Channel-wise divisive normalization with per-channel calibration.
Synergistic improvement: Only V2+GRN+FCMAE yields substantial self-supervised gains over supervised V1 (+1.3pp on ImageNet-1K for Large).

Lightweight Variants: E-ConvNeXt

E-ConvNeXt targets edge/mid-range regimes by integrating Cross Stage Partial Network (CSPNet), optimizing the convolutional stem, and replacing LayerScale with efficient channel attention (ESE). Model scaling and performance:

Model	#Params	GFLOPs	Top-1 Acc.
ConvNeXt-tiny	28.6 M	4.47 G	82.1%
E-ConvNeXt-mini	7.6 M	0.93 G	78.3%
E-ConvNeXt-tiny	13.2 M	2.04 G	80.6%
E-ConvNeXt-small	19.4 M	3.12 G	81.9%

The architectural modifications yield substantial reductions in computational and parameter cost (up to 80%) with modest accuracy loss, confirming the effectiveness of CSP-inspired partial computation strategies and lightweight attention (Wang et al., 28 Aug 2025).

6. Applications and Empirical Results

ConvNeXt demonstrates strong empirical performance across diverse domains:

Image Classification: On ImageNet-1K, ConvNeXt-B matches Swin-B while outperforming RegNetY-16G in both accuracy and throughput (Liu et al., 2022). ConvNeXt V2 further improves accuracy, especially under self-supervised pre-training (Woo et al., 2023).
Object Detection and Segmentation: On COCO, ConvNeXt-T/B/XL consistently surpass Swin counterparts for box and mask AP (Liu et al., 2022). UperNet-based segmentation on ADE20K validates transferability for dense prediction.
Medical Imaging: ConvNeXt-small achieves 94.33% AUC and 93.36% accuracy for breast cancer detection, outperforming EfficientNetV2-S (92.34% AUC) on RSNA mammography (Hasan, 24 May 2025). In polyp segmentation, ConvNeXt-Base+MPE achieves IoU 0.8163 and Dice 0.8818, outperforming UNet backbones (Mau et al., 2023).
Scientific Reconstruction: μ-Net leverages stacked 3D ConvNeXt blocks for cosmic muon tomography, achieving 17.14 dB PSNR at 1 024 muons and outperforming classical PoCA across all dose levels (Lim et al., 2023).

ConvNeXt’s flexible architecture and block design facilitate efficient adaptation for 2D/3D medical, scientific, and industrial use-cases.

7. Large-Kernel and Mixed-Kernel Extensions

Explorations into large-kernel design (InceptionNeXt) demonstrate that organizing the ConvNeXt token-mixing step as parallel small-square, 1×k_b, k_b×1 band convolutions, and identity mapping markedly increases training throughput and reduces memory-bound slowdowns associated with large 7×7 depthwise convolutions on modern hardware (Yu et al., 2023). Notably, InceptionNeXt-T achieves a 57% training throughput increase and a 0.2% top-1 improvement over ConvNeXt-T (82.3% vs. 82.1%). Band kernels are critical for preserving receptive field without excessive parameters or FLOPs.

This suggests that future ConvNeXt-style designs can adopt mixed-kernel or partial-channel demultiplexing to efficiently cover wider receptive fields, reduce memory-access bottlenecks, and further lower carbon footprint without sacrificing accuracy. Implementation at the kernel level or dynamically allocating branches per channel group represents promising directions (Yu et al., 2023).

In summary, ConvNeXt establishes a paradigm in which convolutional architectures, equipped with transformer-era innovations in block design, normalization, and training, consistently approach or surpass transformer baselines under comparable computational constraints. The architecture’s adaptability (V2 self-supervised, E-ConvNeXt edge, large-kernel decompositions), and demonstrated empirical success across domains substantiates its status as a reference architecture for scalable, efficient visual recognition.