nnSAM Model: Hybrid Medical Segmentation
- nnSAM is a hybrid architecture that combines a frozen SAM encoder with a customizable nnUNet encoder and decoder for robust few-shot medical image segmentation.
- It employs dual-head outputs with composite losses, including Dice, cross-entropy, level-set, and curvature supervision, to enforce anatomical boundary regularity.
- nnSAM2 extends this framework to multi-modality segmentation with minimal annotation through iterative refinement and pseudo-labeling, achieving expert-level performance.
nnSAM refers to a family of hybrid neural architectures that integrate the @@@@2@@@@ (SAM) with the nnUNet framework to advance medical image segmentation under low-data or few-shot regimes. nnSAM architectures leverage domain-agnostic features from powerful vision foundation models while retaining the automatic and data-driven adaptivity of nnUNet. In subsequent developments, nnSAM2 further refines this paradigm, facilitating expert-level multi-modality and multi-center segmentation with minimal annotation effort. The following provides a comprehensive technical overview of nnSAM and nnSAM2, encompassing architectural principles, mathematical loss formulations, training regimes, benchmarking, and translational implications.
1. Architectural Components of nnSAM
nnSAM is constructed as a parallel, plug-and-play fusion of two encoders feeding into an nnUNet-based decoder:
- Pretrained, frozen SAM encoder (specifically, MobileSAM distilled ViT), delivering highly generalizable, domain-agnostic image embeddings.
- Trainable nnUNet encoder, automatically configured per dataset with respect to number of layers, kernel sizes, normalization strategy, data augmentations, and optimizer selection.
Embeddings from both encoders are spatially aligned and concatenated at each resolution. The hybrid representation is then processed by the nnUNet decoder with two output heads:
- Segmentation head—producing pixel-wise semantic segmentation, trained with a composite Dice plus cross-entropy loss.
- Level-set regression head—delivering a signed distance map (level set function), optimized using MSE loss on the regression output and a boundary curvature loss to enforce anatomical shape priors.
Only the nnUNet encoder and decoder weights are updated during training; the SAM encoder remains frozen. This design enables nnSAM to benefit from generic visual abstraction and dataset-specific adaptability without retraining the large foundational encoder (Li et al., 2023).
2. Boundary-Shape Supervision Loss
nnSAM introduces a boundary-shape supervision loss to learn anatomical regularity from limited ground-truth data. Let be the predicted probability at pixel for class , and be the one-hot ground truth.
- Segmentation Loss:
- Level-Set Regression Loss:
Define the signed distance function for each class:
Loss:
- Curvature Supervision:
To focus on boundary regularity, is mapped via , and local curvature is computed through first and second derivatives:
The curvature loss:
- Total Loss:
Empirically, , , (Li et al., 2023).
The combined supervision enables the model to capture region overlap, boundary alignment, and geometric plausibility, acting as a learnt shape prior in few-shot scenarios.
3. Training Regimes, Preprocessing, and Data Augmentation
nnSAM leverages nnUNet’s full auto-configuration pipeline:
- Intensity normalization per dataset
- Resizing: 2D slices to 256×256 for nnUNet encoder, then upsampled to 1024×1024 for SAM input
- Online augmentations including random rotations, scaling, and elastic deformations
- Minibatch sizes and learning rates are auto-configured
- Only the nnUNet encoder/decoder is fine-tuned; the SAM encoder remains fixed
Training experiments investigate small-sample regimes (5–20 labeled samples), charting the relationship between sample size and segmentation accuracy across modalities and organs. The regression and curvature heads are shown to reduce overfitting and increase anatomical regularity, particularly in low-data settings (Li et al., 2023).
4. Quantitative Benchmarking
nnSAM is evaluated on four segmentation benchmarks: brain white matter (MR), heart substructure (CT), liver (CT), and chest X-ray lung (X-ray). Metrics include DICE similarity coefficient (DICE) and average surface distance (ASD).
| Task | Method | DICE (%) | ASD (mm) |
|---|---|---|---|
| MR White Matter (N=20) | nnUNet | 79.25 ± 17.24 | 1.36 ± 1.63 |
| nnSAM | 82.77 ± 10.12 | 1.14 ± 1.03 | |
| CT Heart Substructure (N=20) | nnUNet | 93.76 ± 2.95 | 1.48 ± 0.65 |
| nnSAM | 94.19 ± 1.51 | 1.36 ± 0.42 | |
| CT Liver (N=20) | nnUNet | 83.69 ± 26.32 | 6.70 ± 15.66 |
| nnSAM | 85.24 ± 23.74 | 6.18 ± 16.02 | |
| Chest X-ray Lung (N=20) | nnUNet | 93.01 ± 2.41 | 1.63 ± 0.57 |
| nnSAM | 93.63 ± 1.49 | 1.47 ± 0.42 |
In every scenario, nnSAM surpasses nnUNet and other baselines (UNet, Attention UNet, SwinUNet, TransUNet, AutoSAM), with improvements accentuated at lower N. This suggests that leveraging advanced pre-trained representations and shape regularization is especially advantageous in data-constrained settings (Li et al., 2023).
5. Extension: nnSAM2 for One-Shot Multi-Modality Segmentation
nnSAM2 generalizes the nnSAM paradigm to enable one-shot or few-shot segmentation in challenging, heterogeneous, multi-site cohorts (Zhang et al., 7 Oct 2025). The core pipeline is as follows:
Stage 1: SAM2 Pseudo-Label Generation
- One manually labeled axial slice per dataset is used as a prompt.
- SAM2 produces pseudo-labels for all slices, with IoU confidence scores assessed.
Stage 2: Iterative nnU-Net Refinement
- Pseudo-labels are filtered by IoU and anatomical constraints.
- Three sequential nnU-Net models are trained independently on the top confidence masks at each stage, with progressively refined outputs.
Core Metrics and Evaluation
- Large-scale (1,219 scans, 19,439 slices, 762 subjects, 6 datasets).
- Mean DICE for lumbar paraspinal muscle segmentation: 0.94–0.96 (MR) and 0.92–0.93 (CT), exceeding previous methods under the same low-shot constraint.
- Measurement equivalence (muscle volume, fat ratio, CT attenuation) to expert references (TOST, ) and high reliability (ICC 0.86–1.00).
Comparative Table
| Dataset | Method | Mean DICE (L/R) |
|---|---|---|
| AFL T2W MRI | SAM2 | 0.92/0.93 |
| FAMNet | 0.69/0.67 | |
| TotalSegmentor | 0.80/0.79 | |
| nnSAM2 | 0.95/0.95 | |
| Back-pain T1W | SAM2 | 0.94/0.94 |
| FAMNet | 0.68/0.67 | |
| TotalSegmentor | 0.84/0.82 | |
| nnSAM2 | 0.96/0.96 |
The annotation burden is reduced by >99% relative to traditional workflows; only six slices are labeled for 19,433 test samples. A plausible implication is that this efficiency, combined with expert-level segmentation, positions nnSAM2 as a practical approach for multicenter clinical deployment and minimally supervised phenotyping (Zhang et al., 7 Oct 2025).
6. Clinical and Methodological Implications
nnSAM architecture supports:
- Rapid and robust deployment to rare pathologies, novel modalities, or anatomies with minimal annotation.
- End-to-end learning of anatomical shape priors via curvature and distance-map supervision, promoting boundary regularity and plausibility.
- Implementation as fully automatic (prompt-free) segmentation solutions in operational settings, potentially enabling one-shot/zero-shot generalization.
Validated across MRI/CT/Dixon and multi-domain settings, nnSAM and nnSAM2 provide an empirical foundation for foundation-model-driven segmentation strategies in medical imaging, showing reliable performance gains as the sample size decreases (Li et al., 2023, Zhang et al., 7 Oct 2025).