MedNeXt: 3D ConvNet for Medical Segmentation

Updated 27 December 2025

MedNeXt is a 3D ConvNet segmentation architecture that adapts ConvNeXt blocks with depthwise separable convolutions and residual connections for efficient medical image analysis.
It employs a U-Net-style encoder-decoder topology and compound scaling across depth, width, and receptive field to optimize performance from limited data.
The architecture features iterative UpKern kernel scaling and deep supervision with hybrid loss functions to enhance segmentation of complex lesions.

MedNeXt is a 3D ConvNet segmentation architecture that adapts and scales ConvNeXt blocks for volumetric medical imaging. It integrates depthwise separable convolutions, U-Net-style encoder-decoder topology, and compound scaling strategies, and is designed for high data-efficiency and adaptability across modalities, lesion types, and data quality regimes. MedNeXt, along with its successors (notably MedNeXt-v2), has established itself as a state-of-the-art backbone for medical image segmentation, outperforming both transformer-based and standard CNN architectures under limited and large-scale data settings (Roy et al., 2023, Roy et al., 19 Dec 2025, Jaheen et al., 31 Jul 2025, Musah, 3 Aug 2025).

1. Core Architectural Principles and Block Design

The MedNeXt backbone is composed of residual ConvNeXt-style bottleneck blocks adapted for 3D convolutions. Each block comprises:

A 3D depthwise convolution (typically $3\times3\times3$ , upgraded to $5\times5\times5$ via UpKern in some variants) operating channel-wise.
Channel expansion and compression via two pointwise ( $1\times1\times1$ ) convolutions, with expansion factor $R$ (default $R=4$ ).
Normalization layers situated before or after the depthwise convolution depending on variant: 3D LayerNorm or InstanceNorm are used in place of classic GroupNorm or BatchNorm.
Nonlinear activation by GELU between the expansion and compression layers.
A residual pathway that adds the block input to its output.

The typical block operation sequence is:

$y = \mathrm{Conv}_{1\times1\times1}\bigl(\mathrm{GELU}\bigl(\mathrm{Conv}_{1\times1\times1}\bigl(\mathrm{Conv}_{3\times3\times3}^{\mathrm{DW}}(\mathrm{Norm}(x))\bigr)\bigr)\bigr)$

$\mathrm{Block}(x) = x + y$

In MedNeXt-v2, a 3D Global Response Normalization (GRN) is inserted after the nonlinearity, stabilizing training during large channel expansion. GRN computes per-channel $\ell_2$ norms to prevent feature collapse as expansion ratios increase (Roy et al., 19 Dec 2025).

No explicit attention or gating is employed beyond the residual path, relying instead on the learnable locality and capacity provided by depthwise convolutions and powerful expansion.

2. Encoder–Decoder Topology and Compound Scaling

MedNeXt follows a U-Net-like macro-architecture with symmetric encoder and decoder trajectories:

Encoder: Sequential downsampling stages, each halving spatial resolution while doubling channel width, composed of MedNeXt blocks.
Bottleneck: The deepest layer with the largest channel width (up to 1024 in large variants).
Decoder: Mirrors the encoder, upsampling via transposed convolutions or interpolation + $1\times1\times1$ convs; includes skip connections for spatial feature fusion.

The architecture supports compound scaling along three axes:

Depth: Number of ConvNeXt blocks per stage.
Width: Channel multiplication factors.
Receptive field: Kernel size ( $k=3$ up to $k=7$ ).

Scaling is controlled by a single coefficient $\phi$ (following the EfficientNet approach):

$d = d_0 \alpha^{\phi}, \quad w = w_0 \beta^{\phi}, \quad r = r_0 \gamma^{\phi}$

Selection of $(\alpha, \beta, \gamma)$ trades off parameter efficiency against compute and accuracy. MedNeXt-v2 formalizes four model scales (Tiny to Large), with base channels ranging from 32 to 96 and patch sizes from $64^3$ to $192^3$ voxels (Roy et al., 19 Dec 2025). Model details for the base variant:

Stage	Blocks	Channels	Kernel	Down/Up Sample
Stem	–	64	1×1×1	–
Enc 1	3	64	3×3×3	Conv3 s=2 (64→128)
Enc 2	3	128	3×3×3	Conv3 s=2 (128→256)
Enc 3	9	256	3×3×3	Conv3 s=2 (256→512)
Enc 4	3	512	3×3×3	Conv3 s=2 (512→1024)
Enc 5	3	1024	3×3×3	–
Bottleneck	–	1024	–	–
Decoder*	mirror	–	–	TransConv3D s=2
Head	–	$C_{out}$	1×1×1	–

*The decoder stages mirror the encoder in reverse.

Enlarged regions-of-interest are obtained by cropping and padding to fixed volumetric extents (e.g., $160\times160\times128$ ), which provides additional spatial context for tumor boundaries (Jaheen et al., 31 Jul 2025).

3. Iterative Kernel Scaling (UpKern) and Receptive Field

MedNeXt introduces the UpKern algorithm to expand convolutional kernel sizes without training instability:

Initial training uses $3\times3\times3$ kernels.
Kernels are upsampled by trilinear interpolation to $5\times5\times5$ , $7\times7\times7$ , etc., enabling larger receptive fields.
All non-kernel parameters are copied verbatim.
No from-scratch retraining is needed, maintaining the learned local filters and extending spatial reach (Roy et al., 2023, Musah, 3 Aug 2025).

The effective receptive field expands from 31 to 61 voxels (for 4-stage encoders, stride 2), approximately doubling contextual capacity for segmentation (Musah, 3 Aug 2025).

4. Deep Supervision, Loss Functions, and Training Details

MedNeXt and EMedNeXt employ deeply supervised outputs at multiple decoder depths, each supervised by hybrid Dice–Focal or Dice–cross-entropy losses:

Auxiliary segmentation heads emit predictions at progressively coarser resolutions.
In EMedNeXt, the total loss aggregates per-scale boundary-aware objectives:

$L_{\mathrm{total}} = \sum_{i=0}^{3} w_i \Big( L_{\text{Dice–Focal}}(p_i, g) + 0.5\,L_{\text{boundary}}(p_i, g) \Big),\quad w_i = 2^{-i}$

where

$L_{\text{boundary}} = \left\|\nabla_{\text{Sobel}}(p_i) - \nabla_{\text{Sobel}}(g) \right\|_2^2$

Boundary loss emphasizes contour agreement, supporting recovery of small or blurred lesions in low-quality data (Jaheen et al., 31 Jul 2025).

Training regimens use AdamW (with schedule-free or linear decay), small fixed batch sizes, extensive test-time augmentations (e.g., 7-way flipping), and data normalization strategies (z-score, isotropic resampling). nnU-Net preprocessing and augmentation pipelines are standard.

5. Model Ensembling and Postprocessing

Robust inference is achieved by:

Training multiple checkpoints/architectural variants (e.g., MedNeXtV2 Base=3, Base=5).
Performing sliding-window inference (e.g., $160\times160\times128$ patches, 50% overlap).
Aggregating predictions via weighted softmax map averaging:

$\hat P_{c}(x, y, z) = \frac{\sum_{m=1}^{M} w_{m, c} P_{m, c}(x, y, z)}{\sum_{m=1}^{M} w_{m, c}}, \quad w_{m, c}=1$

Postprocessing steps include hard thresholding, 26-connected component filtering by size and mean score, hierarchy enforcement (ET $\subset$ TC $\subset$ WT), and final class-priority fusion (Jaheen et al., 31 Jul 2025, Musah, 3 Aug 2025).

These procedures reduce spurious false positives and stabilize predictions in degraded, low-SNR imaging contexts.

6. Empirical Results and Impact in Medical Segmentation

MedNeXt and its enhanced variants (notably EMedNeXt and MedNeXt-v2) have demonstrated:

Consistently superior Dice and boundary (e.g., normalized surface Dice) metrics versus nnU-Net and transformer-based competitors across benchmark datasets and tasks (BraTS, AMOS22, KiTS19, BTCV) (Roy et al., 2023, Roy et al., 19 Dec 2025).
On low-quality, limited-data settings (e.g., sub-Saharan Africa MRI), incremental architectural and training enhancements provide reproducible gains:

Model/Enhancement	Mean DSC	NSD (0.5 mm)	NSD (1.0 mm)
MedNeXt V1 baseline	0.839	0.395	—
+ MedNeXt V2 backbone	0.873	0.472	—
+ SSA fine-tuning	0.884	0.518	—
+ “Base=3” channels	0.878	0.513	—
+ Ensemble models	0.896	0.537	—
+ Optimized postprocessing	0.897	0.541	0.84

Boundary adherence and small lesion recovery are particularly improved, as evidenced by normalized surface Dice and visualizations (Jaheen et al., 31 Jul 2025).

For breast tumor segmentation, UpKern-initialized large-kernel MedNeXt achieved Dice score 0.67 and normalized Hausdorff 0.24, outperforming baseline $3\times3\times3$ models (Dice 0.64, Hausdorff 0.30) (Musah, 3 Aug 2025).

Compared to transformer-based architectures (SwinUNETR, UNETR), MedNeXt achieves superior performance in data-limited contexts due to higher convolutional inductive bias and stable optimization. MedNeXt-v2’s introduction of 3D GRN further addresses feature collapse during width scaling, supporting larger, more expressive models (Roy et al., 19 Dec 2025).

Ablation studies demonstrate diminishing returns for width scaling when pretraining data is fixed, and a more substantial advantage from input patch/context scaling on boundary-sensitive tasks. The UpKern mechanism enables stable training of large receptive fields on small datasets where naive training would fail (Musah, 3 Aug 2025).

Limitations include increased computational and memory costs for large kernels, and the challenge of generalizing “from scratch” when scaling up receptive fields. Proposed future work includes dynamic kernel sizing within/ across blocks, integration of clinical variables via cross-modal attention, and advanced post-hoc calibration for downstream clinical predictions.

MedNeXt’s open-source implementations are available, notably in the nnU-Net pipeline, facilitating further comparative study and application (Roy et al., 2023, Jaheen et al., 31 Jul 2025, Roy et al., 19 Dec 2025, Musah, 3 Aug 2025).