MedSAM+UNet: Hybrid Medical Segmentation

Updated 14 September 2025

The paper demonstrates a hybrid architecture that fuses a frozen, prompt-driven transformer encoder with an adaptive U-Net encoder to leverage both domain-agnostic features and dataset-specific refinements.
It employs innovative techniques such as level set and curvature-based loss functions for anatomical regularization, ensuring smooth and accurate segmentation boundaries.
The model shows superior performance in low-data regimes with improved Dice scores and boundary consistency compared to traditional U-Net and other segmentation architectures.

MedSAM+UNet denotes hybrid models or frameworks that integrate adaptations of the Segment Anything Model (SAM) for medical imaging (“MedSAM”) with the U-Net segmentation architecture or employ SAM-derived representations within U-Net-like decoders. This paradigm aims to combine the zero-shot or generalist capabilities of SAM/MedSAM’s transformer encoders with U-Net’s proven local refinement, automatic configuration, and anatomical prior learning for medical image segmentation. The following sections detail technical principles, integration strategies, loss formulations, benchmarking, limitations, and future research directions based on recent literature.

1. Hybrid Architecture: Parallel Encoders and Multi-head Decoding

MedSAM+UNet implementations fundamentally fuse two architectural streams: a frozen, prompt-driven transformer encoder (MedSAM) alongside an adaptive, data-driven U-Net encoder, with outputs concatenated for joint decoding (Li et al., 2023). Notable instantiations include nnSAM, which leverages a pretrained ViT-based MedSAM feature extractor (frozen during training) and an auto-configured nnUNet trained end-to-end. The workflow typically proceeds as follows:

Image preprocessing and resizing to SAM’s input requirements (e.g., 1024×1024).
SAM/MedSAM encoder: Produces robust global feature embeddings, independent of medical image modality.
nnUNet encoder: Adapts layer depth, kernel size, normalization, and data augmentation based on the training set, ensuring dataset-specific features.
Fusion via concatenation: Both embeddings are merged and fed into a multi-head decoder, which includes
- A segmentation head (Dice + cross-entropy losses)
- A regression head for boundary shape supervision.

This design allows low-sample learning by retaining SAM’s domain-agnostic features while capitalizing on nnUNet’s empirical optimization and skip connections.

Component	Source Module	Role
ViT Encoder (frozen)	MedSAM/SAM	Domain-agnostic global feature maps
Adaptive Encoder	nnUNet	Dataset-specific, local pattern mining
Decoder (multi-head)	Custom	Segmentation + shape regression

2. Boundary Shape Supervision: Level Set and Curvature Losses

Advanced implementations introduce anatomical regularization and shape priors via level set and curvature-based loss terms. Ground truth masks are converted to signed distance maps (level sets), and predicted segmentations are required to align both the mask and the level set boundary (Li et al., 2023). Let $y$ be the binary mask and $\varphi$ the corresponding level set:

$\varphi(a, b) = \begin{cases} -d(a, b) & \text{inside} \ 0 & \text{boundary} \ +d(a, b) & \text{outside} \end{cases}$

Predicted $\varphi'$ is matched using a mean squared error:

$\mathcal{L}_{levelset} = \frac{1}{HWC} \sum_{a,b,j} \left[ \varphi_j(a, b) - \varphi'_j(a, b) \right]^2$

Boundary regularization is achieved by sharpening $\varphi$ with a sigmoid and matching curvature:

$\mathcal{L}_{curv} = \left| K_{\hat{\varphi}} - K_{\hat{\varphi}'} \right|$

where $K_{\hat{\varphi}}$ is computed from the first and second spatial derivatives. Final loss is a weighted sum:

$\mathcal{L} = \lambda_1 \mathcal{L}_{seg} + \lambda_2 \mathcal{L}_{levelset} + \lambda_3 \mathcal{L}_{curv}$

This joint supervision yields anatomically plausible and smooth boundaries, substantial for small-sample training.

3. Performance and Small-sample Learning

MedSAM+UNet architectures demonstrate superior results in low-data regimes, with pronounced improvements in Dice scores and boundary consistency relative to vanilla nnUNet, AutoSAM, and standard U-Net/Attention U-Net. Key findings include (Li et al., 2023):

Brain white matter (MR) segmentation (20 samples): Dice 82.77% (nnSAM) vs 79.25% (nnUNet); ASD 1.14mm vs 1.36mm
CT Heart: nnSAM Dice 94.19%, outperforming both nnUNet and classic architectures.
Sample size studies: As training set shrinks (e.g., 5 samples), performance gap widens; nnSAM maintains robustness, confirming the benefit of SAM’s generalist features plus strong anatomical priors.

This few-shot learning capacity makes hybrid models highly applicable in limited-data clinical environments.

4. Prompt Optimization and Integration Strategies

SAM/MedSAM are prompt-driven; accuracy on medical images fluctuates with prompt type and placement. Box and centroid point prompts systematically outperform arbitrary or semantic prompts, but their manual selection is unscalable (He et al., 2023). Hybrid approaches use U-Net or nnUNet to:

Generate initial masks and bounding boxes automatically
Provide spatial context for MedSAM’s prompt encoder
Enable iterative refinement via dual-stage processing (see RFMedSAM 2 (Xie et al., 4 Feb 2025))

Further, dynamic support sets or pseudo-label mining can reduce prompt dependency, either via self-prompting modules or support-set guided mechanisms (Xing et al., 24 Jun 2025).

5. Weakly-supervised and Semi-supervised Adaptations

Recent studies leverage MedSAM to create pseudo labels from unlabeled data, which are then used to train U-Net or UNet++ models weakly/semi-supervised (Häkkinen et al., 30 Sep 2024, Mao et al., 10 Mar 2025). Performance drop versus fully-supervised settings is <0.04 Dice on most organs, while annotation effort is drastically reduced.

Pseudo labels generated with box prompts are consistently high-quality.
Weakly-supervised UNet models trained with these pseudo labels approach the performance of fully-supervised counterparts.

The knowledge distillation from SAM/MedSAM to UNet models is robust even when labeled data volume decreases.

6. Evaluation, Limitations, and Statistical Instability

Benchmark studies show mixed results for MedSAM+UNet in extremely low-data regimes (Konrad et al., 7 Sep 2025). Macro-average Dice and IoU scores for MedSAM+UNet (Dice ~0.735±0.06 with LOOCV; ~0.620±0.01 with 3-Fold CV) lag behind SegFormer and standard U-Net, particularly with small sample sizes. Variance in scores is mainly attributed to data split instability—subtle boundary mismatches can produce large metric changes. The implication is that architectural improvements may be masked by limited or noisy data; thus, clinical deployment should weigh both mean scores and variance, possibly favoring ensembles or more stable models under uncertainty.

7. Future Directions and Technical Innovations

Future research is directed at:

Incorporating test-time adaptation modules (e.g., SAM-TTA’s self-adaptive Bezier curve-based transformation and dual-scale uncertainty-driven mean teacher adaptation (Wu et al., 5 Jun 2025)).
Automatic prompt generation and refinement, as seen in Self-Prompt-SAM (Xie et al., 2 Feb 2025) and RFMedSAM 2 (Xie et al., 4 Feb 2025), which outperforms nnUNet by 2–12% Dice on leading benchmarks.
Efficient LoRA-based fine-tuning and support-set driven attention (Xing et al., 24 Jun 2025), addressing domain shift and computational barriers for medical image adaptation.
Implicit representation models (e.g., I-MedSAM (Wei et al., 2023)) merging MedSAM and high-frequency adapters, overcoming discretization artifacts and improving boundary delineation with fewer trainable parameters.

These innovations demonstrate that MedSAM+UNet hybrids are evolving into adaptive, prompt-free, shape-regularized and uncertainty-aware structures, capable of effective few-shot and cross-domain segmentation with scalable clinical deployment.

In summary, MedSAM+UNet models exploit the complementary strengths of medical-adapted transformer encoders and local refinement decoders, augmented with advanced prompt generation, anatomical regularization, and statistical robustness modules. While performance varies with data availability and evaluation protocol, recent advances position these hybrids at the forefront of efficient, accurate, and clinically viable medical image segmentation.