Papers
Topics
Authors
Recent
Search
2000 character limit reached

nnSAM2: Few-Shot Medical Segmentation

Updated 16 June 2026
  • nnSAM2 is a few-shot medical image segmentation framework that uses a single manually labeled slice per dataset to achieve high annotation efficiency and robust generalizability.
  • It employs a two-stage pipeline by first generating pseudo-labels with a frozen SAM2 and then refining them through a sequential three-step nnU-Net process to ensure 3D contextual accuracy.
  • Empirical results demonstrate significant improvements in Dice Similarity Coefficient and statistically equivalent clinical measurements across diverse MRI and CT protocols.

nnSAM2 is a few-shot medical image segmentation framework designed for multi-modality analysis of lumbar paraspinal muscles on MRI and CT, leveraging the promptability of SAM2 and the 3D contextual awareness of nnU-Net. Its primary innovation is achieving high segmentation accuracy and quantitative agreement with clinical expert measurements using only a single manually labeled axial slice per dataset, thereby establishing new standards in annotation efficiency, generalizability, and statistical reproducibility across multi-center and multi-protocol imaging (Zhang et al., 7 Oct 2025).

1. Model Architecture and Computational Pipeline

nnSAM2 is a two-stage composition, integrating a frozen SAM2 foundation model for pseudo-label generation and a triplet of sequential nnU-Nets for robust 3D refinement.

Stage 1 – One-Prompt Pseudo-Label Generation with SAM2:

For each dataset dd, a single “reference” volume is automatically selected by identifying the scan whose DINOv2 feature centroid over key L3/L4–L5/S1 slices is closest to the population mean. Only the top axial slice of this reference is manually annotated, making this the sole source of ground truth in dd. An interleaved sequence of slices—alternating between reference and inference volume slices—is fed into frozen SAM2, which, using the single annotated prompt slice, produces a mask MfM_f and an IoU confidence score sfs_f for every target slice ff. The resulting collection {Mf,sf}\{M_f, s_f\} across all volumes forms the pseudo-label pool.

Stage 2 – Three-Step Sequential nnU-Net Refinement:

Three independent nnU-Net models are trained in series, each using filtered pseudo-labels selected by progressively stricter reliability (confidence) and anatomical smoothness metrics.

  • Step 1: The first nnU-Net (nnU-Net1\textrm{nnU-Net}_1) is trained on the pool union of the top 10% of masks (by sfs_f) per dataset and the top 2% by sfs_f per slice, for 1,000 epochs using a hybrid Dice+cross-entropy loss.
  • Step 2: nnU-Net1\textrm{nnU-Net}_1 predictions dd0 are retained if dd1 and dd2. The top 10% by DSC are selected to train dd3.
  • Step 3: dd4 predictions dd5 are subject to dd6 and dd7, with the top 20% by DSC used to train dd8—the model deployed for inference in practice.

Pseudocode for this pipeline is provided explicitly in (Zhang et al., 7 Oct 2025), ensuring complete reproducibility.

2. Loss Functions, Metrics, and Statistical Equivalence Analysis

The nnU-Net refiners employ a hybrid loss: dd9 where

MfM_f0

and MfM_f1 controls the tradeoff.

Segmentation quality is quantified via the Dice Similarity Coefficient: MfM_f2

Statistical equivalence between automated and manual composition metrics (muscle volume, fat ratio, CT attenuation) is assessed with Two One-Sided Tests (TOST), using equivalence margins MfM_f3 and associated tests on mean measurement differences MfM_f4: MfM_f5

Reproducibility is further quantified by the intraclass correlation coefficient: MfM_f6 where MfM_f7 and MfM_f8 denote between- and within-subject variances, respectively.

3. Dataset Composition, Few-Shot Protocol, and Preprocessing

Six datasets (four MR, two CT) encompass 1,219 scans and 19,439 slices, incorporating multiple modalities, contrasts, and protocols:

  • MRI: AFL T2W, Back-pain T2W, Back-pain T1W, AGBRESA Dixon
  • CT: TotalSegmentator multi-protocol, WORD contrast-enhanced

Only one manually labeled axial slice per dataset is used; all other slices serve for testing. Preprocessing is modality-specific: MRI data are intensity-clipped and rescaled, CT data are clipped in Hounsfield units and downsampled to ~5 mm through-plane resolution before resizing to MfM_f9. Interleaved slicing strategies inject implicit 3D context into the sequential transformer of SAM2 during pseudo-labeling.

4. Empirical Results: Segmentation and Measurement Concordance

Segmentation Accuracy (DSC):

  • MRI (DSC, mean ± SD): AFL T2W, 0.95 ± 0.02; AGBRESA Dixon, 0.94 ± 0.02; Back-pain T1W, 0.96 ± 0.01; Back-pain T2W, 0.96 ± 0.01.
  • CT: TotalSegmentator, 0.92–0.93 ± 0.02; WORD, 0.92 ± 0.02.

Relative to strong baselines, nnSAM2 offers absolute gains of 0.02–0.03 over vanilla SAM2, 0.17–0.29 over FAMNet, and 0.03–0.15 over TotalSegmentator.

Composition Measurement Equivalence and Reliability:

  • Muscle volume (MRI): MAE = 4.35–15.47 mL, min sfs_f0% = 0.34–8.60%, ICC = 0.86–1.00.
  • Fat ratio (Dixon): MAE = 0.0089–0.0108, sfs_f1% ≤ 4.86%, ICC = 0.96–0.97.
  • Muscle volume (CT): MAE = 7.79–12.66 mL, sfs_f2% = 7.26–13.25%, ICC = 0.90–0.93.
  • CT attenuation: MAE = 1.97–4.33 HU, sfs_f3% = 9.94–13.07%, ICC = 0.92–0.99.

All TOST sfs_f4-values are sfs_f50.05, confirming statistical equivalence with expert-derived metrics (Zhang et al., 7 Oct 2025).

5. Implementation, Efficiency, and Reproducibility

All code, pre-trained weights, and manual labels are openly disseminated (https://github.com/johnnydfci/nnSAM2). The pipeline is built on Ubuntu 18.04, with a single NVIDIA RTX 3060 GPU, using Python 3.10, nnU-Net v1.7.1, and a PyTorch backend. Each nnU-Net is trained for 1,000 epochs, disabling mirroring augmentation to preserve anatomical orientation.

Annotation efficiency is high: six expert-segmented slices cover six geographically and biophysically diverse datasets. Generalizability is empirically validated on multi-modal, multi-center, and multi-population data, encompassing both MRI and CT acquisitions and multiple clinical protocols.

6. Limitations and Prospective Development

Main limitations include reliance on single-expert labels (albeit reviewed by a clinical professor), harmonization of LES/MF muscle classes across modalities (precluding fine-grained distinctions), and potential underperformance with extremely sparse 3D coverage—where direct SAM2 inference may serve as a fallback.

Proposed future directions involve extension to other anatomic sites, integration with emerging 3D promptable foundation models (e.g., video-based SAM2), exploration of semi-supervised variants with annotated full 3D volumes, and automated optimization of prompt reference selection in feature space.

7. Conceptual Significance and Broader Impact

nnSAM2 demonstrates that state-of-the-art, cross-modality segmentation and automated clinical index measurement are achievable with a radically constrained manual annotation budget by leveraging promptable foundation models and three-stage pseudo-label refinement. This confirms that single-prompt, foundation model–based pipelines can deliver high annotation efficiency, robust generalizability, and quantitative reliability, representing a new paradigm for few-shot medical image analysis (Zhang et al., 7 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to nnSAM2.