nnSAM2: Few-Shot Medical Segmentation
- nnSAM2 is a few-shot medical image segmentation framework that uses a single manually labeled slice per dataset to achieve high annotation efficiency and robust generalizability.
- It employs a two-stage pipeline by first generating pseudo-labels with a frozen SAM2 and then refining them through a sequential three-step nnU-Net process to ensure 3D contextual accuracy.
- Empirical results demonstrate significant improvements in Dice Similarity Coefficient and statistically equivalent clinical measurements across diverse MRI and CT protocols.
nnSAM2 is a few-shot medical image segmentation framework designed for multi-modality analysis of lumbar paraspinal muscles on MRI and CT, leveraging the promptability of SAM2 and the 3D contextual awareness of nnU-Net. Its primary innovation is achieving high segmentation accuracy and quantitative agreement with clinical expert measurements using only a single manually labeled axial slice per dataset, thereby establishing new standards in annotation efficiency, generalizability, and statistical reproducibility across multi-center and multi-protocol imaging (Zhang et al., 7 Oct 2025).
1. Model Architecture and Computational Pipeline
nnSAM2 is a two-stage composition, integrating a frozen SAM2 foundation model for pseudo-label generation and a triplet of sequential nnU-Nets for robust 3D refinement.
Stage 1 – One-Prompt Pseudo-Label Generation with SAM2:
For each dataset , a single “reference” volume is automatically selected by identifying the scan whose DINOv2 feature centroid over key L3/L4–L5/S1 slices is closest to the population mean. Only the top axial slice of this reference is manually annotated, making this the sole source of ground truth in . An interleaved sequence of slices—alternating between reference and inference volume slices—is fed into frozen SAM2, which, using the single annotated prompt slice, produces a mask and an IoU confidence score for every target slice . The resulting collection across all volumes forms the pseudo-label pool.
Stage 2 – Three-Step Sequential nnU-Net Refinement:
Three independent nnU-Net models are trained in series, each using filtered pseudo-labels selected by progressively stricter reliability (confidence) and anatomical smoothness metrics.
- Step 1: The first nnU-Net () is trained on the pool union of the top 10% of masks (by ) per dataset and the top 2% by per slice, for 1,000 epochs using a hybrid Dice+cross-entropy loss.
- Step 2: predictions 0 are retained if 1 and 2. The top 10% by DSC are selected to train 3.
- Step 3: 4 predictions 5 are subject to 6 and 7, with the top 20% by DSC used to train 8—the model deployed for inference in practice.
Pseudocode for this pipeline is provided explicitly in (Zhang et al., 7 Oct 2025), ensuring complete reproducibility.
2. Loss Functions, Metrics, and Statistical Equivalence Analysis
The nnU-Net refiners employ a hybrid loss: 9 where
0
and 1 controls the tradeoff.
Segmentation quality is quantified via the Dice Similarity Coefficient: 2
Statistical equivalence between automated and manual composition metrics (muscle volume, fat ratio, CT attenuation) is assessed with Two One-Sided Tests (TOST), using equivalence margins 3 and associated tests on mean measurement differences 4: 5
Reproducibility is further quantified by the intraclass correlation coefficient: 6 where 7 and 8 denote between- and within-subject variances, respectively.
3. Dataset Composition, Few-Shot Protocol, and Preprocessing
Six datasets (four MR, two CT) encompass 1,219 scans and 19,439 slices, incorporating multiple modalities, contrasts, and protocols:
- MRI: AFL T2W, Back-pain T2W, Back-pain T1W, AGBRESA Dixon
- CT: TotalSegmentator multi-protocol, WORD contrast-enhanced
Only one manually labeled axial slice per dataset is used; all other slices serve for testing. Preprocessing is modality-specific: MRI data are intensity-clipped and rescaled, CT data are clipped in Hounsfield units and downsampled to ~5 mm through-plane resolution before resizing to 9. Interleaved slicing strategies inject implicit 3D context into the sequential transformer of SAM2 during pseudo-labeling.
4. Empirical Results: Segmentation and Measurement Concordance
Segmentation Accuracy (DSC):
- MRI (DSC, mean ± SD): AFL T2W, 0.95 ± 0.02; AGBRESA Dixon, 0.94 ± 0.02; Back-pain T1W, 0.96 ± 0.01; Back-pain T2W, 0.96 ± 0.01.
- CT: TotalSegmentator, 0.92–0.93 ± 0.02; WORD, 0.92 ± 0.02.
Relative to strong baselines, nnSAM2 offers absolute gains of 0.02–0.03 over vanilla SAM2, 0.17–0.29 over FAMNet, and 0.03–0.15 over TotalSegmentator.
Composition Measurement Equivalence and Reliability:
- Muscle volume (MRI): MAE = 4.35–15.47 mL, min 0% = 0.34–8.60%, ICC = 0.86–1.00.
- Fat ratio (Dixon): MAE = 0.0089–0.0108, 1% ≤ 4.86%, ICC = 0.96–0.97.
- Muscle volume (CT): MAE = 7.79–12.66 mL, 2% = 7.26–13.25%, ICC = 0.90–0.93.
- CT attenuation: MAE = 1.97–4.33 HU, 3% = 9.94–13.07%, ICC = 0.92–0.99.
All TOST 4-values are 50.05, confirming statistical equivalence with expert-derived metrics (Zhang et al., 7 Oct 2025).
5. Implementation, Efficiency, and Reproducibility
All code, pre-trained weights, and manual labels are openly disseminated (https://github.com/johnnydfci/nnSAM2). The pipeline is built on Ubuntu 18.04, with a single NVIDIA RTX 3060 GPU, using Python 3.10, nnU-Net v1.7.1, and a PyTorch backend. Each nnU-Net is trained for 1,000 epochs, disabling mirroring augmentation to preserve anatomical orientation.
Annotation efficiency is high: six expert-segmented slices cover six geographically and biophysically diverse datasets. Generalizability is empirically validated on multi-modal, multi-center, and multi-population data, encompassing both MRI and CT acquisitions and multiple clinical protocols.
6. Limitations and Prospective Development
Main limitations include reliance on single-expert labels (albeit reviewed by a clinical professor), harmonization of LES/MF muscle classes across modalities (precluding fine-grained distinctions), and potential underperformance with extremely sparse 3D coverage—where direct SAM2 inference may serve as a fallback.
Proposed future directions involve extension to other anatomic sites, integration with emerging 3D promptable foundation models (e.g., video-based SAM2), exploration of semi-supervised variants with annotated full 3D volumes, and automated optimization of prompt reference selection in feature space.
7. Conceptual Significance and Broader Impact
nnSAM2 demonstrates that state-of-the-art, cross-modality segmentation and automated clinical index measurement are achievable with a radically constrained manual annotation budget by leveraging promptable foundation models and three-stage pseudo-label refinement. This confirms that single-prompt, foundation model–based pipelines can deliver high annotation efficiency, robust generalizability, and quantitative reliability, representing a new paradigm for few-shot medical image analysis (Zhang et al., 7 Oct 2025).