Papers
Topics
Authors
Recent
2000 character limit reached

nnSAM Model: Hybrid Medical Segmentation

Updated 1 December 2025
  • nnSAM is a hybrid architecture that combines a frozen SAM encoder with a customizable nnUNet encoder and decoder for robust few-shot medical image segmentation.
  • It employs dual-head outputs with composite losses, including Dice, cross-entropy, level-set, and curvature supervision, to enforce anatomical boundary regularity.
  • nnSAM2 extends this framework to multi-modality segmentation with minimal annotation through iterative refinement and pseudo-labeling, achieving expert-level performance.

nnSAM refers to a family of hybrid neural architectures that integrate the @@@@2@@@@ (SAM) with the nnUNet framework to advance medical image segmentation under low-data or few-shot regimes. nnSAM architectures leverage domain-agnostic features from powerful vision foundation models while retaining the automatic and data-driven adaptivity of nnUNet. In subsequent developments, nnSAM2 further refines this paradigm, facilitating expert-level multi-modality and multi-center segmentation with minimal annotation effort. The following provides a comprehensive technical overview of nnSAM and nnSAM2, encompassing architectural principles, mathematical loss formulations, training regimes, benchmarking, and translational implications.

1. Architectural Components of nnSAM

nnSAM is constructed as a parallel, plug-and-play fusion of two encoders feeding into an nnUNet-based decoder:

  • Pretrained, frozen SAM encoder (specifically, MobileSAM distilled ViT), delivering highly generalizable, domain-agnostic image embeddings.
  • Trainable nnUNet encoder, automatically configured per dataset with respect to number of layers, kernel sizes, normalization strategy, data augmentations, and optimizer selection.

Embeddings from both encoders are spatially aligned and concatenated at each resolution. The hybrid representation is then processed by the nnUNet decoder with two output heads:

  • Segmentation head—producing pixel-wise semantic segmentation, trained with a composite Dice plus cross-entropy loss.
  • Level-set regression head—delivering a signed distance map (level set function), optimized using MSE loss on the regression output and a boundary curvature loss to enforce anatomical shape priors.

Only the nnUNet encoder and decoder weights are updated during training; the SAM encoder remains frozen. This design enables nnSAM to benefit from generic visual abstraction and dataset-specific adaptability without retraining the large foundational encoder (Li et al., 2023).

2. Boundary-Shape Supervision Loss

nnSAM introduces a boundary-shape supervision loss to learn anatomical regularity from limited ground-truth data. Let pj(a,b)p_j(a,b) be the predicted probability at pixel (a,b)(a,b) for class jj, and yj(a,b){0,1}y_j(a,b)\in\{0,1\} be the one-hot ground truth.

  • Segmentation Loss:

LDICE=12a,b,jpj(a,b)yj(a,b)a,b,j[pj(a,b)+yj(a,b)]L_\mathrm{DICE} = 1 - \frac{2 \sum_{a,b,j} p_j(a,b)y_j(a,b)}{\sum_{a,b,j}[p_j(a,b) + y_j(a,b)]}

LCE=1HWCa,b,jyj(a,b)lnpj(a,b)L_\mathrm{CE} = -\frac{1}{HWC} \sum_{a,b,j} y_j(a,b)\,\ln p_j(a,b)

Total segmentation loss: Ls=LDICE+LCE\text{Total segmentation loss: } L_s = L_\mathrm{DICE} + L_\mathrm{CE}

  • Level-Set Regression Loss:

Define the signed distance function φ(a,b)\varphi(a,b) for each class:

φ(a,b)={d(a,b),(a,b) inside the object 0,(a,b) on the boundary +d(a,b),(a,b) outside the object\varphi(a,b) = \begin{cases} -d(a,b), & (a,b)\text{ inside the object}\ 0, & (a,b)\text{ on the boundary}\ +d(a,b), & (a,b)\text{ outside the object} \end{cases}

Loss:

Ll=1HWCa,b,j[φj(a,b)φj(a,b)]2L_l = \frac{1}{HWC}\sum_{a,b,j} [\varphi_j(a,b) - \varphi'_j(a,b)]^2

  • Curvature Supervision:

To focus on boundary regularity, φ\varphi is mapped via φ^(a,b)=Sigmoid(1000φ(a,b))\hat{\varphi}(a,b) = \mathrm{Sigmoid}(-1000 \varphi(a,b)), and local curvature Kφ^K_{\hat\varphi} is computed through first and second derivatives:

Kφ^=(1+φ^a2)φ^bb+(1+φ^b2)φ^aa2φ^aφ^bφ^ab2[1+φ^a2+φ^b2]3/2K_{\hat\varphi} = \frac{|(1 + \hat\varphi_a^2) \hat\varphi_{bb} + (1 + \hat\varphi_b^2) \hat\varphi_{aa} - 2 \hat\varphi_a \hat\varphi_b \hat\varphi_{ab}|}{2 [1 + \hat\varphi_a^2 + \hat\varphi_b^2]^{3/2}}

The curvature loss:

Lc=1HWCa,b,jKφ^j(a,b)Kφ^j(a,b)L_c = \frac{1}{HWC}\sum_{a,b,j}|K_{\hat\varphi_j}(a,b) - K_{\hat\varphi'_j}(a,b)|

  • Total Loss:

L=λ1Ls+λ2Ll+λ3LcL = \lambda_1 L_s + \lambda_2 L_l + \lambda_3 L_c

Empirically, λ1=1.0\lambda_1=1.0, λ2=0.1\lambda_2=0.1, λ3=104\lambda_3=10^{-4} (Li et al., 2023).

The combined supervision enables the model to capture region overlap, boundary alignment, and geometric plausibility, acting as a learnt shape prior in few-shot scenarios.

3. Training Regimes, Preprocessing, and Data Augmentation

nnSAM leverages nnUNet’s full auto-configuration pipeline:

  • Intensity normalization per dataset
  • Resizing: 2D slices to 256×256 for nnUNet encoder, then upsampled to 1024×1024 for SAM input
  • Online augmentations including random rotations, scaling, and elastic deformations
  • Minibatch sizes and learning rates are auto-configured
  • Only the nnUNet encoder/decoder is fine-tuned; the SAM encoder remains fixed

Training experiments investigate small-sample regimes (5–20 labeled samples), charting the relationship between sample size and segmentation accuracy across modalities and organs. The regression and curvature heads are shown to reduce overfitting and increase anatomical regularity, particularly in low-data settings (Li et al., 2023).

4. Quantitative Benchmarking

nnSAM is evaluated on four segmentation benchmarks: brain white matter (MR), heart substructure (CT), liver (CT), and chest X-ray lung (X-ray). Metrics include DICE similarity coefficient (DICE) and average surface distance (ASD).

Task Method DICE (%) ASD (mm)
MR White Matter (N=20) nnUNet 79.25 ± 17.24 1.36 ± 1.63
nnSAM 82.77 ± 10.12 1.14 ± 1.03
CT Heart Substructure (N=20) nnUNet 93.76 ± 2.95 1.48 ± 0.65
nnSAM 94.19 ± 1.51 1.36 ± 0.42
CT Liver (N=20) nnUNet 83.69 ± 26.32 6.70 ± 15.66
nnSAM 85.24 ± 23.74 6.18 ± 16.02
Chest X-ray Lung (N=20) nnUNet 93.01 ± 2.41 1.63 ± 0.57
nnSAM 93.63 ± 1.49 1.47 ± 0.42

In every scenario, nnSAM surpasses nnUNet and other baselines (UNet, Attention UNet, SwinUNet, TransUNet, AutoSAM), with improvements accentuated at lower N. This suggests that leveraging advanced pre-trained representations and shape regularization is especially advantageous in data-constrained settings (Li et al., 2023).

5. Extension: nnSAM2 for One-Shot Multi-Modality Segmentation

nnSAM2 generalizes the nnSAM paradigm to enable one-shot or few-shot segmentation in challenging, heterogeneous, multi-site cohorts (Zhang et al., 7 Oct 2025). The core pipeline is as follows:

Stage 1: SAM2 Pseudo-Label Generation

  • One manually labeled axial slice per dataset is used as a prompt.
  • SAM2 produces pseudo-labels for all slices, with IoU confidence scores assessed.

Stage 2: Iterative nnU-Net Refinement

  • Pseudo-labels are filtered by IoU and anatomical constraints.
  • Three sequential nnU-Net models are trained independently on the top confidence masks at each stage, with progressively refined outputs.

Core Metrics and Evaluation

  • Large-scale (1,219 scans, 19,439 slices, 762 subjects, 6 datasets).
  • Mean DICE for lumbar paraspinal muscle segmentation: 0.94–0.96 (MR) and 0.92–0.93 (CT), exceeding previous methods under the same low-shot constraint.
  • Measurement equivalence (muscle volume, fat ratio, CT attenuation) to expert references (TOST, p<0.05p<0.05) and high reliability (ICC 0.86–1.00).

Comparative Table

Dataset Method Mean DICE (L/R)
AFL T2W MRI SAM2 0.92/0.93
FAMNet 0.69/0.67
TotalSegmentor 0.80/0.79
nnSAM2 0.95/0.95
Back-pain T1W SAM2 0.94/0.94
FAMNet 0.68/0.67
TotalSegmentor 0.84/0.82
nnSAM2 0.96/0.96

The annotation burden is reduced by >99% relative to traditional workflows; only six slices are labeled for 19,433 test samples. A plausible implication is that this efficiency, combined with expert-level segmentation, positions nnSAM2 as a practical approach for multicenter clinical deployment and minimally supervised phenotyping (Zhang et al., 7 Oct 2025).

6. Clinical and Methodological Implications

nnSAM architecture supports:

  • Rapid and robust deployment to rare pathologies, novel modalities, or anatomies with minimal annotation.
  • End-to-end learning of anatomical shape priors via curvature and distance-map supervision, promoting boundary regularity and plausibility.
  • Implementation as fully automatic (prompt-free) segmentation solutions in operational settings, potentially enabling one-shot/zero-shot generalization.

Validated across MRI/CT/Dixon and multi-domain settings, nnSAM and nnSAM2 provide an empirical foundation for foundation-model-driven segmentation strategies in medical imaging, showing reliable performance gains as the sample size decreases (Li et al., 2023, Zhang et al., 7 Oct 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to nnSAM Model.