Papers
Topics
Authors
Recent
2000 character limit reached

SAM3-UNet: Efficient Fine-Tuning for Dense Tasks

Updated 8 December 2025
  • The paper presents SAM3-UNet, a model that fine-tunes a pre-trained SAM3 encoder using minimal adapters and a U-Net–style decoder for tasks such as mirror and salient object detection.
  • It leverages a ViT-style backbone with 446M parameters combined with 24 lightweight adapter modules, optimizing performance while maintaining a low compute and memory footprint.
  • Empirical results show notable improvements in IoU and other metrics over SAM2-UNet, demonstrating its efficiency and practical applicability for dense prediction tasks on commodity hardware.

SAM3-UNet is a simplified adaptation of Segment Anything Model 3 (SAM3), designed to enable parameter-efficient fine-tuning of SAM3 for downstream dense prediction tasks such as mirror detection and salient object detection. The architecture combines the frozen perception capabilities of SAM3’s substantial ViT-style image encoder with lightweight trainable adapters and an efficient U-Net–style decoder, yielding strong performance while maintaining a low compute and memory footprint (Xiong et al., 1 Dec 2025).

1. Architecture Structure

The model pipeline begins with an input image IRH×W×3I \in \mathbb{R}^{H \times W \times 3}, processed by a frozen SAM3 image encoder. This encoder is a ViT-style backbone with $446$ million parameters. To allow task adaptation, a small trainable adapter is inserted before each of the L=24L=24 transformer blocks in the encoder, but all ViT weights remain frozen. The output tokens, with spatial shape (H/14)×(W/14)×1024(H/14) \times (W/14) \times 1024, are projected through 1×11 \times 1 convolutions into four parallel feature maps, each with $128$ channels. Bilinear up- and downsampling establish a four-level hierarchical feature representation:

  • F1RH/4×W/4×128F_1 \in \mathbb{R}^{H/4 \times W/4 \times 128}
  • F2RH/8×W/8×128F_2 \in \mathbb{R}^{H/8 \times W/8 \times 128}
  • F3RH/16×W/16×128F_3 \in \mathbb{R}^{H/16 \times W/16 \times 128}
  • F4RH/32×W/32×128F_4 \in \mathbb{R}^{H/32 \times W/32 \times 128}

These feature maps are consumed by a four-stage, U-Net–style decoder, where each upsampling stage integrates skip-connections from the encoder via concatenation and further feature mixing.

2. Adapter and Fine-Tuning Paradigm

Adapters are bottleneck structures inserted before each transformer block for parameter-efficient adaptation. For block \ell with input x()RN×dx^{(\ell)} \in \mathbb{R}^{N \times d} (d=1024d=1024), the adapter augments the residual pathway: z()=x()+A(x())z^{(\ell)} = x^{(\ell)} + A\bigl(x^{(\ell)}\bigr) where

A(x)=Wupσ(Wdownx)A(x) = W_{\text{up}}\,\sigma\left(W_{\text{down}}\,x\right)

WdownRr×dW_{\text{down}} \in \mathbb{R}^{r \times d}, WupRd×rW_{\text{up}} \in \mathbb{R}^{d \times r}, and σ\sigma is the GELU nonlinearity. The bottleneck dimension is r=32r=32, yielding $65,536$ parameters per adapter and 24×65,5361.5724 \times 65,536 \approx 1.57 million adapter parameters in total. Decoder and projection heads add another 0.5\approx 0.5 million parameters, resulting in 2.1\approx 2.1 million fine-tuned parameters (about 0.5%0.5\% of the full model footprint).

3. Lightweight U-Net-Style Decoder

The decoder employs a “bottleneck + depthwise split” block at each upsampling stage. For input feature DRH×W×CD \in \mathbb{R}^{H' \times W' \times C}:

  • Apply 1×11 \times 1 Conv, BN, GELU to reduce to C/4C/4 channels.
  • Split along channel dimension: [Da,Db][D_a, D_b].
  • DbD_b is processed by two sequential 3×33 \times 3 DWConv, BN, GELU layers, yielding DcD_c and DdD_d.
  • Output features [Da,Db,Dc,Dd][D_a, D_b, D_c, D_d] are concatenated and mapped back to the full output channel count via 1×11 \times 1 Conv, BN, GELU.

Each decoder stage ii upsamples Di+1D_{i+1}, concatenates with encoder feature FiF_i, and processes via the above block, preserving the multi-scale semantic structure established by the backbone.

4. Training Protocol and Resource Profiling

Experiments span two primary tasks: mirror detection (datasets: MSD, PMD) and salient object detection (DUTS-TR with DUTS-TE, DUT-OMRON, HKU-IS, PASCAL-S, ECSSD). The loss is a weighted sum of binary cross-entropy and IoU losses: L=LBCEω+LIoUω\mathcal{L} = \mathcal{L}^{\omega}_{\mathrm{BCE}} + \mathcal{L}^{\omega}_{\mathrm{IoU}} with class-balancing weights ωi\omega_i, following F³Net (AAAI 2020). Training employs AdamW (lr 2×1042\times 10^{-4}, weight decay 1×1021\times 10^{-2}, cosine schedule), batch size $12$, 20 epochs, and input resolution 336×336336 \times 336. Augmentations include random horizontal and vertical flips. On a single NVIDIA RTX 4090 GPU (24 GB), total GPU memory consumption is below 6 GB at batch size 12, as a consequence of the frozen main encoder, compact adapters, and efficient decoder block design.

5. Empirical Results

SAM3-UNet surpasses both its immediate predecessor, SAM2-UNet, and other state-of-the-art methods across target tasks.

Mirror Detection (MSD and PMD):

Method IoU F MAE IoU F MAE
MSD PMD
SAM2-UNet 0.918 0.957 0.022 0.728 0.826 0.027
SAM3-UNet 0.943 0.972 0.014 0.804 0.884 0.017

Salient Object Detection (selected datasets):

Method DUTS-TE (Sα/Eφ/MAE) DUT-OMRON HKU-IS PASCAL-S ECSSD
SAM2-UNet 0.934/0.959/0.020 0.884/0.912/0.039 0.941/0.971/0.019 0.894/0.931/0.043 0.950/0.970/0.020
SAM3-UNet 0.936/0.964/0.019 0.895/0.921/0.034 0.939/0.968/0.020 0.904/0.939/0.038 0.950/0.970/0.019

Performance gains are observed on all metrics, with SAM3-UNet achieving a +2.5% IoU improvement on MSD and +7.6% on PMD. On salient object detection, it matches or outperforms prior methods on four out of five held-out datasets, with a +1.1% SαS_\alpha on DUT-OMRON.

6. Ablations and Design Analysis

Preliminary experiments substantiate several architectural decisions:

  • Adapter bottleneck size r=32r=32 achieves the optimum parameter-efficiency and performance trade-off, compared to r{16,64}r \in \{16, 64\}.
  • Substituting the lightweight depthwise-split decoder block with a standard two-conv block results in 20% slower training convergence and 30% increased memory usage.
  • The four-level decoding hierarchy (H/4(H/4 through H/32)H/32) consistently outperforms reduced-depth variants by 0.5–1.0% IoU at negligible additional cost.

7. Implementation Details and Code Access

SAM3-UNet is implemented in PyTorch. The official repository is available at https://github.com/WZH0120/SAM3-UNet. Reproduction involves standard procedures: cloning the repository, installing the listed packages, configuring dataset paths, and executing train/evaluation scripts. The package includes evaluation and visualization tools.


SAM3-UNet achieves state-of-the-art accuracy for mirror and salient object detection by leveraging a frozen SAM3 backbone, minimal task adapters, and a modern U-Net–style decoder. The model operates efficiently within 6 GB GPU memory during training with a batch size of 12, allowing rapid fine-tuning on commodity hardware (Xiong et al., 1 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SAM3-UNet.