SAM3-UNet: Efficient Fine-Tuning for Dense Tasks

Updated 8 December 2025

The paper presents SAM3-UNet, a model that fine-tunes a pre-trained SAM3 encoder using minimal adapters and a U-Net–style decoder for tasks such as mirror and salient object detection.
It leverages a ViT-style backbone with 446M parameters combined with 24 lightweight adapter modules, optimizing performance while maintaining a low compute and memory footprint.
Empirical results show notable improvements in IoU and other metrics over SAM2-UNet, demonstrating its efficiency and practical applicability for dense prediction tasks on commodity hardware.

SAM3-UNet is a simplified adaptation of Segment Anything Model 3 (SAM3), designed to enable parameter-efficient fine-tuning of SAM3 for downstream dense prediction tasks such as mirror detection and salient object detection. The architecture combines the frozen perception capabilities of SAM3’s substantial ViT-style image encoder with lightweight trainable adapters and an efficient U-Net–style decoder, yielding strong performance while maintaining a low compute and memory footprint (Xiong et al., 1 Dec 2025).

1. Architecture Structure

The model pipeline begins with an input image $I \in \mathbb{R}^{H \times W \times 3}$ , processed by a frozen SAM3 image encoder. This encoder is a ViT-style backbone with $446$ million parameters. To allow task adaptation, a small trainable adapter is inserted before each of the $L=24$ transformer blocks in the encoder, but all ViT weights remain frozen. The output tokens, with spatial shape $(H/14) \times (W/14) \times 1024$ , are projected through $1 \times 1$ convolutions into four parallel feature maps, each with $128$ channels. Bilinear up- and downsampling establish a four-level hierarchical feature representation:

$F_1 \in \mathbb{R}^{H/4 \times W/4 \times 128}$
$F_2 \in \mathbb{R}^{H/8 \times W/8 \times 128}$
$F_3 \in \mathbb{R}^{H/16 \times W/16 \times 128}$
$F_4 \in \mathbb{R}^{H/32 \times W/32 \times 128}$

These feature maps are consumed by a four-stage, U-Net–style decoder, where each upsampling stage integrates skip-connections from the encoder via concatenation and further feature mixing.

2. Adapter and Fine-Tuning Paradigm

Adapters are bottleneck structures inserted before each transformer block for parameter-efficient adaptation. For block $\ell$ with input $x^{(\ell)} \in \mathbb{R}^{N \times d}$ ( $d=1024$ ), the adapter augments the residual pathway: $z^{(\ell)} = x^{(\ell)} + A\bigl(x^{(\ell)}\bigr)$ where

$A(x) = W_{\text{up}}\,\sigma\left(W_{\text{down}}\,x\right)$

$W_{\text{down}} \in \mathbb{R}^{r \times d}$ , $W_{\text{up}} \in \mathbb{R}^{d \times r}$ , and $\sigma$ is the GELU nonlinearity. The bottleneck dimension is $r=32$ , yielding $65,536$ parameters per adapter and $24 \times 65,536 \approx 1.57$ million adapter parameters in total. Decoder and projection heads add another $\approx 0.5$ million parameters, resulting in $\approx 2.1$ million fine-tuned parameters (about $0.5\%$ of the full model footprint).

3. Lightweight U-Net-Style Decoder

The decoder employs a “bottleneck + depthwise split” block at each upsampling stage. For input feature $D \in \mathbb{R}^{H' \times W' \times C}$ :

Apply $1 \times 1$ Conv, BN, GELU to reduce to $C/4$ channels.
Split along channel dimension: $[D_a, D_b]$ .
$D_b$ is processed by two sequential $3 \times 3$ DWConv, BN, GELU layers, yielding $D_c$ and $D_d$ .
Output features $[D_a, D_b, D_c, D_d]$ are concatenated and mapped back to the full output channel count via $1 \times 1$ Conv, BN, GELU.

Each decoder stage $i$ upsamples $D_{i+1}$ , concatenates with encoder feature $F_i$ , and processes via the above block, preserving the multi-scale semantic structure established by the backbone.

4. Training Protocol and Resource Profiling

Experiments span two primary tasks: mirror detection (datasets: MSD, PMD) and salient object detection (DUTS-TR with DUTS-TE, DUT-OMRON, HKU-IS, PASCAL-S, ECSSD). The loss is a weighted sum of binary cross-entropy and IoU losses: $\mathcal{L} = \mathcal{L}^{\omega}_{\mathrm{BCE}} + \mathcal{L}^{\omega}_{\mathrm{IoU}}$ with class-balancing weights $\omega_i$ , following F³Net (AAAI 2020). Training employs AdamW (lr $2\times 10^{-4}$ , weight decay $1\times 10^{-2}$ , cosine schedule), batch size $12$, 20 epochs, and input resolution $336 \times 336$ . Augmentations include random horizontal and vertical flips. On a single NVIDIA RTX 4090 GPU (24 GB), total GPU memory consumption is below 6 GB at batch size 12, as a consequence of the frozen main encoder, compact adapters, and efficient decoder block design.

5. Empirical Results

SAM3-UNet surpasses both its immediate predecessor, SAM2-UNet, and other state-of-the-art methods across target tasks.

Mirror Detection (MSD and PMD):

Method	IoU	F	MAE	IoU	F	MAE
	MSD			PMD
SAM2-UNet	0.918	0.957	0.022	0.728	0.826	0.027
SAM3-UNet	0.943	0.972	0.014	0.804	0.884	0.017

Salient Object Detection (selected datasets):

Method	DUTS-TE (Sα/Eφ/MAE)	DUT-OMRON	HKU-IS	PASCAL-S	ECSSD
SAM2-UNet	0.934/0.959/0.020	0.884/0.912/0.039	0.941/0.971/0.019	0.894/0.931/0.043	0.950/0.970/0.020
SAM3-UNet	0.936/0.964/0.019	0.895/0.921/0.034	0.939/0.968/0.020	0.904/0.939/0.038	0.950/0.970/0.019

Performance gains are observed on all metrics, with SAM3-UNet achieving a +2.5% IoU improvement on MSD and +7.6% on PMD. On salient object detection, it matches or outperforms prior methods on four out of five held-out datasets, with a +1.1% $S_\alpha$ on DUT-OMRON.

6. Ablations and Design Analysis

Preliminary experiments substantiate several architectural decisions:

Adapter bottleneck size $r=32$ achieves the optimum parameter-efficiency and performance trade-off, compared to $r \in \{16, 64\}$ .
Substituting the lightweight depthwise-split decoder block with a standard two-conv block results in 20% slower training convergence and 30% increased memory usage.
The four-level decoding hierarchy $(H/4$ through $H/32)$ consistently outperforms reduced-depth variants by 0.5–1.0% IoU at negligible additional cost.

7. Implementation Details and Code Access

SAM3-UNet is implemented in PyTorch. The official repository is available at https://github.com/WZH0120/SAM3-UNet. Reproduction involves standard procedures: cloning the repository, installing the listed packages, configuring dataset paths, and executing train/evaluation scripts. The package includes evaluation and visualization tools.

SAM3-UNet achieves state-of-the-art accuracy for mirror and salient object detection by leveraging a frozen SAM3 backbone, minimal task adapters, and a modern U-Net–style decoder. The model operates efficiently within 6 GB GPU memory during training with a batch size of 12, allowing rapid fine-tuning on commodity hardware (Xiong et al., 1 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

SAM3-UNet: Simplified Adaptation of Segment Anything Model 3 (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to SAM3-UNet.