Papers
Topics
Authors
Recent
2000 character limit reached

SAM2-UNet: Efficient Segmentation Architectures

Updated 5 January 2026
  • The paper introduces SAM2-UNet by integrating a frozen SAM2-Hiera encoder with lightweight adapters and a U-Net decoder to enable efficient adaptation for segmentation tasks.
  • It utilizes parameter-efficient adapters and multi-scale fusion techniques that enhance segmentation performance across natural and medical imaging benchmarks.
  • Extensive benchmarking and ablation studies demonstrate that SAM2-UNet variants outperform specialized state-of-the-art models while minimizing fine-tuning costs.

SAM2-UNet refers to a family of image segmentation models built upon freezing the Segment Anything Model 2 (SAM2) “Hiera” transformer encoder and combining it with a classic U-shaped decoder, often U-Net, augmented by lightweight, parameter-efficient adapters. This category encompasses not only the original “SAM2-UNet” instantiation but also a spectrum of domain- and modality-adapted architectures, which share the principle of decoupling large foundation model pretraining from efficient plug-and-play downstream adaptation. Architectures in this family have demonstrated strong performance across natural and medical image segmentation tasks, routinely outperforming specialized state-of-the-art models with minimal fine-tuning cost (Xiong et al., 2024, Huo et al., 6 Feb 2025, Xu et al., 27 Mar 2025, Huo et al., 22 May 2025, Zhang et al., 7 Oct 2025).

1. Foundation: SAM2-UNet Core Architecture

At its core, SAM2-UNet integrates the frozen Hiera backbone from SAM2 (defaulting to Hiera-L with 214M parameters) as the encoder and applies a classical 3-stage U-Net decoder. The prompt- and memory-specific components of SAM2 are omitted, retaining only the transformer-based encoder. Four multi-scale feature maps XiX_i with shapes XiRCi×H2i+1×W2i+1X_i \in \mathbb{R}^{C_i \times \frac{H}{2^{i+1}} \times \frac{W}{2^{i+1}}} and Ci={144,288,576,1152}C_i = \{144, 288, 576, 1152\} are produced.

Adapters are inserted before each Hiera block for parameter-efficient fine-tuning. Each adapter consists of a down-projection (WdRCi×rW_d \in \mathbb{R}^{C_i \times r}, r=Ci/16r = C_i/16), GeLU activation, up-projection (WuRr×CiW_u \in \mathbb{R}^{r \times C_i}), and a second GeLU. For input ZRCi×NZ\in\mathbb{R}^{C_i\times N},

A(Z)=GeLU(WuGeLU(WdZ))A(Z) = \mathrm{GeLU}(W_u\, \mathrm{GeLU}(W_d\,Z))

These modules add a negligible number of parameters (≈1.5M total) compared to the frozen encoder (214M), and only adapter weights and the decoder (≈5M) are updated during fine-tuning.

Each decoder stage upsamples via bilinear interpolation, concatenates with the corresponding encoder feature (after channel-reduction and context aggregation via Receptive Field Blocks, RFBs), applies two 3×3 Conv–BN–ReLU layers, and generates intermediate side outputs SjS_j for multi-level deep supervision.

Training employs AdamW with deep supervision: for each SjS_j, the total loss sums weighted IoU and BCE losses:

Ltotal=j=13(LIoUw(G,Sj)+LBCEw(G,Sj))\mathcal{L}_{\text{total}} = \sum_{j=1}^3 \left( \mathcal{L}_{IoU}^w(G,S_j) + \mathcal{L}_{BCE}^w(G,S_j) \right)

with wiw_i pixel weights and pi,gip_i, g_i predicted/ground-truth probabilities and labels, respectively (Xiong et al., 2024).

2. Task Domains and Dataset Coverage

SAM2-UNet, as well as its descendants, have been validated across an exceptionally broad set of segmentation tasks, including:

Universal image resizing (typically to 352 × 352 or 256 × 256) and binary label masks are standard preprocessing steps.

3. Key Variants and Advances

The SAM2-UNet framework underpins several advanced segmentation frameworks, each targeting domain- or task-specific limitations through modular innovations.

Comparison of SAM2-UNet Variants

Model Encoder(s) Key Modules Notable Application(s)
SAM2-UNet SAM2-Hiera (frozen) Adapters, RFBs General segmentation, SOD, COD
FE-UNet SAM2-Hiera (frozen) WSPM, FE-RFB Frequency-balanced, polyp/MAS
DSU-Net SAM2-Hiera + DINOv2-ViT (frozen) Multi-scale cross-modal fusion SOD, COD
SAMba-UNet SAM2-Hiera + VMamba (frozen) DFFR, HOACM Cardiac MRI
nnSAM2 SAM2-Hiera (frozen) + nnU-Net Iterative pseudo-label nnU-Net Ultra-few-shot muscle segmentation

Augments SAM2-UNet with a Wavelet-Guided Spectral Pooling Module (WSPM) and Frequency-Domain Enhanced RFB (FE-RFB) to correct transformers’ deficiency in mid/high-frequency content. FE-UNet employs Haar wavelet-based DWTConv and spectral pooling filters to explicitly control frequency content at each stage, boosting mDice/mIoU on marine and medical datasets relative to both transformer and CNN competitors.

Fuses SAM2’s and DINOv2’s feature hierarchies via channel modulation and content-guided attention (CGA), retaining the U-shaped decoder but leveraging DINOv2 as a high-level semantic injection. Cross-modal attention provides significant boosts in F-measure for camouflage and saliency detection (e.g., +3% F on COD10K). Adapters and channel modulation blocks are solely trainable; both encoders are frozen.

Implements a dual-encoder (SAM2 + VMamba) architecture, introducing a Dynamic Feature Fusion Refiner (DFFR) and Heterogeneous Omni-Attention Convergence Module (HOACM) for medical domain adaptation and heterogeneous fusion. DFFR applies multi-scale pooling/channel calibration and spatially-adaptive refinement, whereas HOACM merges local (SAM2) and global (Mamba) cues. Achieves state-of-the-art mDice (0.9103) and HD95 (1.0859 mm) on ACDC cardiac MRI.

Adopts frozen SAM2 as a pseudo-labeler with one annotated slice per dataset; three rounds of nnU-Net refinement under strict confidence and anatomical constraints yield few-shot segmentation results verifiably equivalent to expert measurements for muscle volume, fat ratio, and HU attenuation across MRI and CT.

4. Quantitative Benchmarking and Ablation Insights

SAM2-UNet and its variants show consistent, often pronounced, performance improvements over prior art; Table 1 references typical leaderboards.

Model (example) COD10K SαS_\alpha NC4K SαS_\alpha MAS3K mIoU MSD IoU Kvasir mDice
ZoomNet 0.838 0.853 0.736 0.798 0.904
FEDER 0.844 0.862
SAM2-UNet 0.880 0.901 0.799 0.918 0.928

Ablation results indicate:

  • Larger Hiera backbones systematically improve performance: e.g., COD10K SαS_\alpha from 0.822 (Tiny) to 0.880 (Large).
  • Parameter-efficient adapter tuning alone allows even small backbones to outperform previous SOTA.
  • Qualitative improvements are pronounced in challenging domains (camouflaged, polyp, or myocardium segmentation), reducing both false positives and negatives.
  • In FE-UNet, removing FE-RFB drops Kvasir mDice from 0.929 to 0.912, highlighting frequency-domain integration necessity (Huo et al., 6 Feb 2025).

5. Implementation and Training Protocols

Standard configuration involves:

  • Encoder: SAM2-Hiera, weights frozen, public pretraining weights.
  • Adapters/Decoders: Only lightweight modules and the U-shaped decoder are updated.
  • Optimization: AdamW; initial learning rate 1e-3 with cosine decay; batch sizes and epochs tailored by task.
  • Hardware: Single RTX 4090 (24GB) for baseline SAM2-UNet; FE-UNet uses 8 × NVIDIA A800s for larger-scale training.
  • Augmentations: Random flips, scaling; data-dependent multi-scale variations.
  • Inference: Fully convolutional with multi-level, deeply supervised outputs (Xiong et al., 2024, Huo et al., 6 Feb 2025, Zhang et al., 7 Oct 2025).

6. Impact and Significance in Image Segmentation

The SAM2-UNet paradigm demonstrates that freezing large vision foundation model encoders, coupled with small adapters and classic U-shaped decoders, provides a scalable, resource-efficient pathway to universal segmentation. This family consistently establishes new baselines, excelling in both natural and medical imaging, including under low data and few-shot regimes.

A plausible implication is that decoupling heavy foundation training from domain adaptation will be a persistent trend. Parameter-efficient adaptation—by adapters, cross-modal injectors, or iterative pseudo-labelers—has rapidly become the default, rather than exception, in foundation model segmentation research. The empirical effectiveness across highly heterogenous segmentation tasks suggests strong generalization and modular extensibility (Xiong et al., 2024, Huo et al., 6 Feb 2025, Xu et al., 27 Mar 2025, Huo et al., 22 May 2025, Zhang et al., 7 Oct 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SAM2-UNet.