nnSAM Model: Hybrid Medical Segmentation

Updated 1 December 2025

nnSAM is a hybrid architecture that combines a frozen SAM encoder with a customizable nnUNet encoder and decoder for robust few-shot medical image segmentation.
It employs dual-head outputs with composite losses, including Dice, cross-entropy, level-set, and curvature supervision, to enforce anatomical boundary regularity.
nnSAM2 extends this framework to multi-modality segmentation with minimal annotation through iterative refinement and pseudo-labeling, achieving expert-level performance.

nnSAM refers to a family of hybrid neural architectures that integrate the @@@@2@@@@ (SAM) with the nnUNet framework to advance medical image segmentation under low-data or few-shot regimes. nnSAM architectures leverage domain-agnostic features from powerful vision foundation models while retaining the automatic and data-driven adaptivity of nnUNet. In subsequent developments, nnSAM2 further refines this paradigm, facilitating expert-level multi-modality and multi-center segmentation with minimal annotation effort. The following provides a comprehensive technical overview of nnSAM and nnSAM2, encompassing architectural principles, mathematical loss formulations, training regimes, benchmarking, and translational implications.

1. Architectural Components of nnSAM

nnSAM is constructed as a parallel, plug-and-play fusion of two encoders feeding into an nnUNet-based decoder:

Pretrained, frozen SAM encoder (specifically, MobileSAM distilled ViT), delivering highly generalizable, domain-agnostic image embeddings.
Trainable nnUNet encoder, automatically configured per dataset with respect to number of layers, kernel sizes, normalization strategy, data augmentations, and optimizer selection.

Embeddings from both encoders are spatially aligned and concatenated at each resolution. The hybrid representation is then processed by the nnUNet decoder with two output heads:

Segmentation head—producing pixel-wise semantic segmentation, trained with a composite Dice plus cross-entropy loss.
Level-set regression head—delivering a signed distance map (level set function), optimized using MSE loss on the regression output and a boundary curvature loss to enforce anatomical shape priors.

Only the nnUNet encoder and decoder weights are updated during training; the SAM encoder remains frozen. This design enables nnSAM to benefit from generic visual abstraction and dataset-specific adaptability without retraining the large foundational encoder (Li et al., 2023).

2. Boundary-Shape Supervision Loss

nnSAM introduces a boundary-shape supervision loss to learn anatomical regularity from limited ground-truth data. Let $p_j(a,b)$ be the predicted probability at pixel $(a,b)$ for class $j$ , and $y_j(a,b)\in\{0,1\}$ be the one-hot ground truth.

Segmentation Loss:

$L_\mathrm{DICE} = 1 - \frac{2 \sum_{a,b,j} p_j(a,b)y_j(a,b)}{\sum_{a,b,j}[p_j(a,b) + y_j(a,b)]}$

$L_\mathrm{CE} = -\frac{1}{HWC} \sum_{a,b,j} y_j(a,b)\,\ln p_j(a,b)$

$\text{Total segmentation loss: } L_s = L_\mathrm{DICE} + L_\mathrm{CE}$

Level-Set Regression Loss:

Define the signed distance function $\varphi(a,b)$ for each class:

$\varphi(a,b) = \begin{cases} -d(a,b), & (a,b)\text{ inside the object}\ 0, & (a,b)\text{ on the boundary}\ +d(a,b), & (a,b)\text{ outside the object} \end{cases}$

Loss:

$L_l = \frac{1}{HWC}\sum_{a,b,j} [\varphi_j(a,b) - \varphi'_j(a,b)]^2$

Curvature Supervision:

To focus on boundary regularity, $\varphi$ is mapped via $\hat{\varphi}(a,b) = \mathrm{Sigmoid}(-1000 \varphi(a,b))$ , and local curvature $K_{\hat\varphi}$ is computed through first and second derivatives:

$K_{\hat\varphi} = \frac{|(1 + \hat\varphi_a^2) \hat\varphi_{bb} + (1 + \hat\varphi_b^2) \hat\varphi_{aa} - 2 \hat\varphi_a \hat\varphi_b \hat\varphi_{ab}|}{2 [1 + \hat\varphi_a^2 + \hat\varphi_b^2]^{3/2}}$

The curvature loss:

$L_c = \frac{1}{HWC}\sum_{a,b,j}|K_{\hat\varphi_j}(a,b) - K_{\hat\varphi'_j}(a,b)|$

Total Loss:

$L = \lambda_1 L_s + \lambda_2 L_l + \lambda_3 L_c$

Empirically, $\lambda_1=1.0$ , $\lambda_2=0.1$ , $\lambda_3=10^{-4}$ (Li et al., 2023).

The combined supervision enables the model to capture region overlap, boundary alignment, and geometric plausibility, acting as a learnt shape prior in few-shot scenarios.

3. Training Regimes, Preprocessing, and Data Augmentation

nnSAM leverages nnUNet’s full auto-configuration pipeline:

Intensity normalization per dataset
Resizing: 2D slices to 256×256 for nnUNet encoder, then upsampled to 1024×1024 for SAM input
Online augmentations including random rotations, scaling, and elastic deformations
Minibatch sizes and learning rates are auto-configured
Only the nnUNet encoder/decoder is fine-tuned; the SAM encoder remains fixed

Training experiments investigate small-sample regimes (5–20 labeled samples), charting the relationship between sample size and segmentation accuracy across modalities and organs. The regression and curvature heads are shown to reduce overfitting and increase anatomical regularity, particularly in low-data settings (Li et al., 2023).

4. Quantitative Benchmarking

nnSAM is evaluated on four segmentation benchmarks: brain white matter (MR), heart substructure (CT), liver (CT), and chest X-ray lung (X-ray). Metrics include DICE similarity coefficient (DICE) and average surface distance (ASD).

Task	Method	DICE (%)	ASD (mm)
MR White Matter (N=20)	nnUNet	79.25 ± 17.24	1.36 ± 1.63
	nnSAM	82.77 ± 10.12	1.14 ± 1.03
CT Heart Substructure (N=20)	nnUNet	93.76 ± 2.95	1.48 ± 0.65
	nnSAM	94.19 ± 1.51	1.36 ± 0.42
CT Liver (N=20)	nnUNet	83.69 ± 26.32	6.70 ± 15.66
	nnSAM	85.24 ± 23.74	6.18 ± 16.02
Chest X-ray Lung (N=20)	nnUNet	93.01 ± 2.41	1.63 ± 0.57
	nnSAM	93.63 ± 1.49	1.47 ± 0.42

In every scenario, nnSAM surpasses nnUNet and other baselines (UNet, Attention UNet, SwinUNet, TransUNet, AutoSAM), with improvements accentuated at lower N. This suggests that leveraging advanced pre-trained representations and shape regularization is especially advantageous in data-constrained settings (Li et al., 2023).

5. Extension: nnSAM2 for One-Shot Multi-Modality Segmentation

nnSAM2 generalizes the nnSAM paradigm to enable one-shot or few-shot segmentation in challenging, heterogeneous, multi-site cohorts (Zhang et al., 7 Oct 2025). The core pipeline is as follows:

Stage 1: SAM2 Pseudo-Label Generation

One manually labeled axial slice per dataset is used as a prompt.
SAM2 produces pseudo-labels for all slices, with IoU confidence scores assessed.

Pseudo-labels are filtered by IoU and anatomical constraints.
Three sequential nnU-Net models are trained independently on the top confidence masks at each stage, with progressively refined outputs.

Core Metrics and Evaluation

Large-scale (1,219 scans, 19,439 slices, 762 subjects, 6 datasets).
Mean DICE for lumbar paraspinal muscle segmentation: 0.94–0.96 (MR) and 0.92–0.93 (CT), exceeding previous methods under the same low-shot constraint.
Measurement equivalence (muscle volume, fat ratio, CT attenuation) to expert references (TOST, $p<0.05$ ) and high reliability (ICC 0.86–1.00).

Comparative Table

Dataset	Method	Mean DICE (L/R)
AFL T2W MRI	SAM2	0.92/0.93
	FAMNet	0.69/0.67
	TotalSegmentor	0.80/0.79
	nnSAM2	0.95/0.95
Back-pain T1W	SAM2	0.94/0.94
	FAMNet	0.68/0.67
	TotalSegmentor	0.84/0.82
	nnSAM2	0.96/0.96

The annotation burden is reduced by >99% relative to traditional workflows; only six slices are labeled for 19,433 test samples. A plausible implication is that this efficiency, combined with expert-level segmentation, positions nnSAM2 as a practical approach for multicenter clinical deployment and minimally supervised phenotyping (Zhang et al., 7 Oct 2025).

6. Clinical and Methodological Implications

nnSAM architecture supports:

Rapid and robust deployment to rare pathologies, novel modalities, or anatomies with minimal annotation.
End-to-end learning of anatomical shape priors via curvature and distance-map supervision, promoting boundary regularity and plausibility.
Implementation as fully automatic (prompt-free) segmentation solutions in operational settings, potentially enabling one-shot/zero-shot generalization.

Validated across MRI/CT/Dixon and multi-domain settings, nnSAM and nnSAM2 provide an empirical foundation for foundation-model-driven segmentation strategies in medical imaging, showing reliable performance gains as the sample size decreases (Li et al., 2023, Zhang et al., 7 Oct 2025).

PDF Markdown Chat (Pro)

References (2)

nnSAM: Plug-and-play Segment Anything Model Improves nnUNet Performance (2023)

nnSAM2: nnUNet-Enhanced One-Prompt SAM2 for Few-shot Multi-Modality Segmentation and Composition Analysis of Lumbar Paraspinal Muscles (2025)

nnSAM Model: Hybrid Medical Segmentation

1. Architectural Components of nnSAM

2. Boundary-Shape Supervision Loss

3. Training Regimes, Preprocessing, and Data Augmentation

4. Quantitative Benchmarking

5. Extension: nnSAM2 for One-Shot Multi-Modality Segmentation

Stage 1: SAM2 Pseudo-Label Generation

Stage 2: Iterative nnU-Net Refinement

Core Metrics and Evaluation

Comparative Table

6. Clinical and Methodological Implications

Whiteboard

Follow Topic

Continue Learning

nnSAM Model: Hybrid Medical Segmentation

1. Architectural Components of nnSAM

2. Boundary-Shape Supervision Loss

3. Training Regimes, Preprocessing, and Data Augmentation

4. Quantitative Benchmarking

5. Extension: nnSAM2 for One-Shot Multi-Modality Segmentation

Stage 1: SAM2 Pseudo-Label Generation

Stage 2: Iterative nnU-Net Refinement

Core Metrics and Evaluation

Comparative Table

6. Clinical and Methodological Implications

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics