Papers
Topics
Authors
Recent
Search
2000 character limit reached

SAM3-UNet: Efficient Fine-Tuning for Dense Tasks

Updated 8 December 2025
  • The paper presents SAM3-UNet, a model that fine-tunes a pre-trained SAM3 encoder using minimal adapters and a U-Net–style decoder for tasks such as mirror and salient object detection.
  • It leverages a ViT-style backbone with 446M parameters combined with 24 lightweight adapter modules, optimizing performance while maintaining a low compute and memory footprint.
  • Empirical results show notable improvements in IoU and other metrics over SAM2-UNet, demonstrating its efficiency and practical applicability for dense prediction tasks on commodity hardware.

SAM3-UNet is a simplified adaptation of Segment Anything Model 3 (SAM3), designed to enable parameter-efficient fine-tuning of SAM3 for downstream dense prediction tasks such as mirror detection and salient object detection. The architecture combines the frozen perception capabilities of SAM3’s substantial ViT-style image encoder with lightweight trainable adapters and an efficient U-Net–style decoder, yielding strong performance while maintaining a low compute and memory footprint (Xiong et al., 1 Dec 2025).

1. Architecture Structure

The model pipeline begins with an input image IRH×W×3I \in \mathbb{R}^{H \times W \times 3}, processed by a frozen SAM3 image encoder. This encoder is a ViT-style backbone with $446$ million parameters. To allow task adaptation, a small trainable adapter is inserted before each of the L=24L=24 transformer blocks in the encoder, but all ViT weights remain frozen. The output tokens, with spatial shape (H/14)×(W/14)×1024(H/14) \times (W/14) \times 1024, are projected through 1×11 \times 1 convolutions into four parallel feature maps, each with $128$ channels. Bilinear up- and downsampling establish a four-level hierarchical feature representation:

  • F1RH/4×W/4×128F_1 \in \mathbb{R}^{H/4 \times W/4 \times 128}
  • F2RH/8×W/8×128F_2 \in \mathbb{R}^{H/8 \times W/8 \times 128}
  • F3RH/16×W/16×128F_3 \in \mathbb{R}^{H/16 \times W/16 \times 128}
  • F4RH/32×W/32×128F_4 \in \mathbb{R}^{H/32 \times W/32 \times 128}

These feature maps are consumed by a four-stage, U-Net–style decoder, where each upsampling stage integrates skip-connections from the encoder via concatenation and further feature mixing.

2. Adapter and Fine-Tuning Paradigm

Adapters are bottleneck structures inserted before each transformer block for parameter-efficient adaptation. For block $446$0 with input $446$1 ($446$2), the adapter augments the residual pathway: $446$3 where

$446$4

$446$5, $446$6, and $446$7 is the GELU nonlinearity. The bottleneck dimension is $446$8, yielding $446$9 parameters per adapter and L=24L=240 million adapter parameters in total. Decoder and projection heads add another L=24L=241 million parameters, resulting in L=24L=242 million fine-tuned parameters (about L=24L=243 of the full model footprint).

3. Lightweight U-Net-Style Decoder

The decoder employs a “bottleneck + depthwise split” block at each upsampling stage. For input feature L=24L=244:

  • Apply L=24L=245 Conv, BN, GELU to reduce to L=24L=246 channels.
  • Split along channel dimension: L=24L=247.
  • L=24L=248 is processed by two sequential L=24L=249 DWConv, BN, GELU layers, yielding (H/14)×(W/14)×1024(H/14) \times (W/14) \times 10240 and (H/14)×(W/14)×1024(H/14) \times (W/14) \times 10241.
  • Output features (H/14)×(W/14)×1024(H/14) \times (W/14) \times 10242 are concatenated and mapped back to the full output channel count via (H/14)×(W/14)×1024(H/14) \times (W/14) \times 10243 Conv, BN, GELU.

Each decoder stage (H/14)×(W/14)×1024(H/14) \times (W/14) \times 10244 upsamples (H/14)×(W/14)×1024(H/14) \times (W/14) \times 10245, concatenates with encoder feature (H/14)×(W/14)×1024(H/14) \times (W/14) \times 10246, and processes via the above block, preserving the multi-scale semantic structure established by the backbone.

4. Training Protocol and Resource Profiling

Experiments span two primary tasks: mirror detection (datasets: MSD, PMD) and salient object detection (DUTS-TR with DUTS-TE, DUT-OMRON, HKU-IS, PASCAL-S, ECSSD). The loss is a weighted sum of binary cross-entropy and IoU losses: (H/14)×(W/14)×1024(H/14) \times (W/14) \times 10247 with class-balancing weights (H/14)×(W/14)×1024(H/14) \times (W/14) \times 10248, following F³Net (AAAI 2020). Training employs AdamW (lr (H/14)×(W/14)×1024(H/14) \times (W/14) \times 10249, weight decay 1×11 \times 10, cosine schedule), batch size 1×11 \times 11, 20 epochs, and input resolution 1×11 \times 12. Augmentations include random horizontal and vertical flips. On a single NVIDIA RTX 4090 GPU (24 GB), total GPU memory consumption is below 6 GB at batch size 12, as a consequence of the frozen main encoder, compact adapters, and efficient decoder block design.

5. Empirical Results

SAM3-UNet surpasses both its immediate predecessor, SAM2-UNet, and other state-of-the-art methods across target tasks.

Mirror Detection (MSD and PMD):

Method IoU F MAE IoU F MAE
MSD PMD
SAM2-UNet 0.918 0.957 0.022 0.728 0.826 0.027
SAM3-UNet 0.943 0.972 0.014 0.804 0.884 0.017

Salient Object Detection (selected datasets):

Method DUTS-TE (Sα/Eφ/MAE) DUT-OMRON HKU-IS PASCAL-S ECSSD
SAM2-UNet 0.934/0.959/0.020 0.884/0.912/0.039 0.941/0.971/0.019 0.894/0.931/0.043 0.950/0.970/0.020
SAM3-UNet 0.936/0.964/0.019 0.895/0.921/0.034 0.939/0.968/0.020 0.904/0.939/0.038 0.950/0.970/0.019

Performance gains are observed on all metrics, with SAM3-UNet achieving a +2.5% IoU improvement on MSD and +7.6% on PMD. On salient object detection, it matches or outperforms prior methods on four out of five held-out datasets, with a +1.1% 1×11 \times 13 on DUT-OMRON.

6. Ablations and Design Analysis

Preliminary experiments substantiate several architectural decisions:

  • Adapter bottleneck size 1×11 \times 14 achieves the optimum parameter-efficiency and performance trade-off, compared to 1×11 \times 15.
  • Substituting the lightweight depthwise-split decoder block with a standard two-conv block results in 20% slower training convergence and 30% increased memory usage.
  • The four-level decoding hierarchy 1×11 \times 16 through 1×11 \times 17 consistently outperforms reduced-depth variants by 0.5–1.0% IoU at negligible additional cost.

7. Implementation Details and Code Access

SAM3-UNet is implemented in PyTorch. The official repository is available at https://github.com/WZH0120/SAM3-UNet. Reproduction involves standard procedures: cloning the repository, installing the listed packages, configuring dataset paths, and executing train/evaluation scripts. The package includes evaluation and visualization tools.


SAM3-UNet achieves state-of-the-art accuracy for mirror and salient object detection by leveraging a frozen SAM3 backbone, minimal task adapters, and a modern U-Net–style decoder. The model operates efficiently within 6 GB GPU memory during training with a batch size of 12, allowing rapid fine-tuning on commodity hardware (Xiong et al., 1 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SAM3-UNet.