SAM3-UNet: Efficient Fine-Tuning for Dense Tasks
- The paper presents SAM3-UNet, a model that fine-tunes a pre-trained SAM3 encoder using minimal adapters and a U-Net–style decoder for tasks such as mirror and salient object detection.
- It leverages a ViT-style backbone with 446M parameters combined with 24 lightweight adapter modules, optimizing performance while maintaining a low compute and memory footprint.
- Empirical results show notable improvements in IoU and other metrics over SAM2-UNet, demonstrating its efficiency and practical applicability for dense prediction tasks on commodity hardware.
SAM3-UNet is a simplified adaptation of Segment Anything Model 3 (SAM3), designed to enable parameter-efficient fine-tuning of SAM3 for downstream dense prediction tasks such as mirror detection and salient object detection. The architecture combines the frozen perception capabilities of SAM3’s substantial ViT-style image encoder with lightweight trainable adapters and an efficient U-Net–style decoder, yielding strong performance while maintaining a low compute and memory footprint (Xiong et al., 1 Dec 2025).
1. Architecture Structure
The model pipeline begins with an input image , processed by a frozen SAM3 image encoder. This encoder is a ViT-style backbone with $446$ million parameters. To allow task adaptation, a small trainable adapter is inserted before each of the transformer blocks in the encoder, but all ViT weights remain frozen. The output tokens, with spatial shape , are projected through convolutions into four parallel feature maps, each with $128$ channels. Bilinear up- and downsampling establish a four-level hierarchical feature representation:
These feature maps are consumed by a four-stage, U-Net–style decoder, where each upsampling stage integrates skip-connections from the encoder via concatenation and further feature mixing.
2. Adapter and Fine-Tuning Paradigm
Adapters are bottleneck structures inserted before each transformer block for parameter-efficient adaptation. For block with input (), the adapter augments the residual pathway: where
, , and is the GELU nonlinearity. The bottleneck dimension is , yielding $65,536$ parameters per adapter and million adapter parameters in total. Decoder and projection heads add another million parameters, resulting in million fine-tuned parameters (about of the full model footprint).
3. Lightweight U-Net-Style Decoder
The decoder employs a “bottleneck + depthwise split” block at each upsampling stage. For input feature :
- Apply Conv, BN, GELU to reduce to channels.
- Split along channel dimension: .
- is processed by two sequential DWConv, BN, GELU layers, yielding and .
- Output features are concatenated and mapped back to the full output channel count via Conv, BN, GELU.
Each decoder stage upsamples , concatenates with encoder feature , and processes via the above block, preserving the multi-scale semantic structure established by the backbone.
4. Training Protocol and Resource Profiling
Experiments span two primary tasks: mirror detection (datasets: MSD, PMD) and salient object detection (DUTS-TR with DUTS-TE, DUT-OMRON, HKU-IS, PASCAL-S, ECSSD). The loss is a weighted sum of binary cross-entropy and IoU losses: with class-balancing weights , following F³Net (AAAI 2020). Training employs AdamW (lr , weight decay , cosine schedule), batch size $12$, 20 epochs, and input resolution . Augmentations include random horizontal and vertical flips. On a single NVIDIA RTX 4090 GPU (24 GB), total GPU memory consumption is below 6 GB at batch size 12, as a consequence of the frozen main encoder, compact adapters, and efficient decoder block design.
5. Empirical Results
SAM3-UNet surpasses both its immediate predecessor, SAM2-UNet, and other state-of-the-art methods across target tasks.
Mirror Detection (MSD and PMD):
| Method | IoU | F | MAE | IoU | F | MAE |
|---|---|---|---|---|---|---|
| MSD | PMD | |||||
| SAM2-UNet | 0.918 | 0.957 | 0.022 | 0.728 | 0.826 | 0.027 |
| SAM3-UNet | 0.943 | 0.972 | 0.014 | 0.804 | 0.884 | 0.017 |
Salient Object Detection (selected datasets):
| Method | DUTS-TE (Sα/Eφ/MAE) | DUT-OMRON | HKU-IS | PASCAL-S | ECSSD |
|---|---|---|---|---|---|
| SAM2-UNet | 0.934/0.959/0.020 | 0.884/0.912/0.039 | 0.941/0.971/0.019 | 0.894/0.931/0.043 | 0.950/0.970/0.020 |
| SAM3-UNet | 0.936/0.964/0.019 | 0.895/0.921/0.034 | 0.939/0.968/0.020 | 0.904/0.939/0.038 | 0.950/0.970/0.019 |
Performance gains are observed on all metrics, with SAM3-UNet achieving a +2.5% IoU improvement on MSD and +7.6% on PMD. On salient object detection, it matches or outperforms prior methods on four out of five held-out datasets, with a +1.1% on DUT-OMRON.
6. Ablations and Design Analysis
Preliminary experiments substantiate several architectural decisions:
- Adapter bottleneck size achieves the optimum parameter-efficiency and performance trade-off, compared to .
- Substituting the lightweight depthwise-split decoder block with a standard two-conv block results in 20% slower training convergence and 30% increased memory usage.
- The four-level decoding hierarchy through consistently outperforms reduced-depth variants by 0.5–1.0% IoU at negligible additional cost.
7. Implementation Details and Code Access
SAM3-UNet is implemented in PyTorch. The official repository is available at https://github.com/WZH0120/SAM3-UNet. Reproduction involves standard procedures: cloning the repository, installing the listed packages, configuring dataset paths, and executing train/evaluation scripts. The package includes evaluation and visualization tools.
SAM3-UNet achieves state-of-the-art accuracy for mirror and salient object detection by leveraging a frozen SAM3 backbone, minimal task adapters, and a modern U-Net–style decoder. The model operates efficiently within 6 GB GPU memory during training with a batch size of 12, allowing rapid fine-tuning on commodity hardware (Xiong et al., 1 Dec 2025).