nnSAM2: Few-Shot Medical Segmentation

Updated 16 June 2026

nnSAM2 is a few-shot medical image segmentation framework that uses a single manually labeled slice per dataset to achieve high annotation efficiency and robust generalizability.
It employs a two-stage pipeline by first generating pseudo-labels with a frozen SAM2 and then refining them through a sequential three-step nnU-Net process to ensure 3D contextual accuracy.
Empirical results demonstrate significant improvements in Dice Similarity Coefficient and statistically equivalent clinical measurements across diverse MRI and CT protocols.

nnSAM2 is a few-shot medical image segmentation framework designed for multi-modality analysis of lumbar paraspinal muscles on MRI and CT, leveraging the promptability of SAM2 and the 3D contextual awareness of nnU-Net. Its primary innovation is achieving high segmentation accuracy and quantitative agreement with clinical expert measurements using only a single manually labeled axial slice per dataset, thereby establishing new standards in annotation efficiency, generalizability, and statistical reproducibility across multi-center and multi-protocol imaging (Zhang et al., 7 Oct 2025).

1. Model Architecture and Computational Pipeline

nnSAM2 is a two-stage composition, integrating a frozen SAM2 foundation model for pseudo-label generation and a triplet of sequential nnU-Nets for robust 3D refinement.

Stage 1 – One-Prompt Pseudo-Label Generation with SAM2:

For each dataset $d$ , a single “reference” volume is automatically selected by identifying the scan whose DINOv2 feature centroid over key L3/L4–L5/S1 slices is closest to the population mean. Only the top axial slice of this reference is manually annotated, making this the sole source of ground truth in $d$ . An interleaved sequence of slices—alternating between reference and inference volume slices—is fed into frozen SAM2, which, using the single annotated prompt slice, produces a mask $M_f$ and an IoU confidence score $s_f$ for every target slice $f$ . The resulting collection $\{M_f, s_f\}$ across all volumes forms the pseudo-label pool.

Stage 2 – Three-Step Sequential nnU-Net Refinement:

Three independent nnU-Net models are trained in series, each using filtered pseudo-labels selected by progressively stricter reliability (confidence) and anatomical smoothness metrics.

Step 1: The first nnU-Net ( $\textrm{nnU-Net}_1$ ) is trained on the pool union of the top 10% of masks (by $s_f$ ) per dataset and the top 2% by $s_f$ per slice, for 1,000 epochs using a hybrid Dice+cross-entropy loss.
Step 2: $\textrm{nnU-Net}_1$ predictions $d$ 0 are retained if $d$ 1 and $d$ 2. The top 10% by DSC are selected to train $d$ 3.
Step 3: $d$ 4 predictions $d$ 5 are subject to $d$ 6 and $d$ 7, with the top 20% by DSC used to train $d$ 8—the model deployed for inference in practice.

Pseudocode for this pipeline is provided explicitly in (Zhang et al., 7 Oct 2025), ensuring complete reproducibility.

2. Loss Functions, Metrics, and Statistical Equivalence Analysis

The nnU-Net refiners employ a hybrid loss: $d$ 9 where

$M_f$ 0

and $M_f$ 1 controls the tradeoff.

Segmentation quality is quantified via the Dice Similarity Coefficient: $M_f$ 2

Statistical equivalence between automated and manual composition metrics (muscle volume, fat ratio, CT attenuation) is assessed with Two One-Sided Tests (TOST), using equivalence margins $M_f$ 3 and associated tests on mean measurement differences $M_f$ 4: $M_f$ 5

Reproducibility is further quantified by the intraclass correlation coefficient: $M_f$ 6 where $M_f$ 7 and $M_f$ 8 denote between- and within-subject variances, respectively.

3. Dataset Composition, Few-Shot Protocol, and Preprocessing

Six datasets (four MR, two CT) encompass 1,219 scans and 19,439 slices, incorporating multiple modalities, contrasts, and protocols:

MRI: AFL T2W, Back-pain T2W, Back-pain T1W, AGBRESA Dixon
CT: TotalSegmentator multi-protocol, WORD contrast-enhanced

Only one manually labeled axial slice per dataset is used; all other slices serve for testing. Preprocessing is modality-specific: MRI data are intensity-clipped and rescaled, CT data are clipped in Hounsfield units and downsampled to ~5 mm through-plane resolution before resizing to $M_f$ 9. Interleaved slicing strategies inject implicit 3D context into the sequential transformer of SAM2 during pseudo-labeling.

4. Empirical Results: Segmentation and Measurement Concordance

Segmentation Accuracy (DSC):

MRI (DSC, mean ± SD): AFL T2W, 0.95 ± 0.02; AGBRESA Dixon, 0.94 ± 0.02; Back-pain T1W, 0.96 ± 0.01; Back-pain T2W, 0.96 ± 0.01.
CT: TotalSegmentator, 0.92–0.93 ± 0.02; WORD, 0.92 ± 0.02.

Relative to strong baselines, nnSAM2 offers absolute gains of 0.02–0.03 over vanilla SAM2, 0.17–0.29 over FAMNet, and 0.03–0.15 over TotalSegmentator.

Composition Measurement Equivalence and Reliability:

Muscle volume (MRI): MAE = 4.35–15.47 mL, min $s_f$ 0% = 0.34–8.60%, ICC = 0.86–1.00.
Fat ratio (Dixon): MAE = 0.0089–0.0108, $s_f$ 1% ≤ 4.86%, ICC = 0.96–0.97.
Muscle volume (CT): MAE = 7.79–12.66 mL, $s_f$ 2% = 7.26–13.25%, ICC = 0.90–0.93.
CT attenuation: MAE = 1.97–4.33 HU, $s_f$ 3% = 9.94–13.07%, ICC = 0.92–0.99.

All TOST $s_f$ 4-values are $s_f$ 50.05, confirming statistical equivalence with expert-derived metrics (Zhang et al., 7 Oct 2025).

5. Implementation, Efficiency, and Reproducibility

All code, pre-trained weights, and manual labels are openly disseminated (https://github.com/johnnydfci/nnSAM2). The pipeline is built on Ubuntu 18.04, with a single NVIDIA RTX 3060 GPU, using Python 3.10, nnU-Net v1.7.1, and a PyTorch backend. Each nnU-Net is trained for 1,000 epochs, disabling mirroring augmentation to preserve anatomical orientation.

Annotation efficiency is high: six expert-segmented slices cover six geographically and biophysically diverse datasets. Generalizability is empirically validated on multi-modal, multi-center, and multi-population data, encompassing both MRI and CT acquisitions and multiple clinical protocols.

6. Limitations and Prospective Development

Main limitations include reliance on single-expert labels (albeit reviewed by a clinical professor), harmonization of LES/MF muscle classes across modalities (precluding fine-grained distinctions), and potential underperformance with extremely sparse 3D coverage—where direct SAM2 inference may serve as a fallback.

Proposed future directions involve extension to other anatomic sites, integration with emerging 3D promptable foundation models (e.g., video-based SAM2), exploration of semi-supervised variants with annotated full 3D volumes, and automated optimization of prompt reference selection in feature space.

7. Conceptual Significance and Broader Impact

nnSAM2 demonstrates that state-of-the-art, cross-modality segmentation and automated clinical index measurement are achievable with a radically constrained manual annotation budget by leveraging promptable foundation models and three-stage pseudo-label refinement. This confirms that single-prompt, foundation model–based pipelines can deliver high annotation efficiency, robust generalizability, and quantitative reliability, representing a new paradigm for few-shot medical image analysis (Zhang et al., 7 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (1)

nnSAM2: nnUNet-Enhanced One-Prompt SAM2 for Few-shot Multi-Modality Segmentation and Composition Analysis of Lumbar Paraspinal Muscles (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to nnSAM2.