RFMedSAM 2: Automated Medical Segmentation

Updated 1 December 2025

The paper presents RFMedSAM 2, which integrates a 3D UNet for automated prompt generation with adapterized SAM2 and a dual-stage refinement pipeline to boost Dice scores.
It employs lightweight DWConvAdapter and CNN-Adapters that add less than 1% extra parameters, enhancing spatial inductive bias and multi-modal feature adaptation.
The system achieves state-of-the-art performance on BTCV and AMOS2022 datasets, incrementally improving Dice from 0.856 to 0.907 in multi-organ CT segmentation.

RFMedSAM 2 is an automated volumetric medical image segmentation system that integrates a prompt-generating 3D UNet with “adapterized” Segment Anything Model 2 (SAM 2) and a dual-stage refinement pipeline, achieving state-of-the-art segmentation accuracy on multi-organ CT datasets. It addresses prompt sensitivity and binary mask limitations of the original SAM/SAM 2 frameworks by coupling deep supervision, spatially inductive adapters, prompt learning, and temporal refinement.

1. Architecture and Fine-Tuning Adaptation Mechanisms

RFMedSAM 2 builds upon SAM 2, which comprises an image encoder (Hiera), prompt encoder, and mask decoder, supplemented with a memory encoder utilizing transformer-based attention for temporal information fusion. The adaptation in RFMedSAM 2 is implemented via two categories of lightweight, trainable modules: Depth-wise-Convolutional Adapters (DWConvAdapter) and CNN-Adapters.

DWConvAdapter modules are integrated after each image embedding attention block in:

Hiera encoder transformer blocks
Memory-attention transformer blocks
Mask decoder transformer blocks

Each DWConvAdapter applies LayerNorm, linear projection, point-wise convolution, depth-wise convolution ( $3 \times 3$ ), GeLU activation, LayerNorm, another point-wise convolution, and a residual skip preserving original embeddings. For $C = 256$ channels, a typical DWConvAdapter introduces approximately 2,816 parameters.

CNN-Adapters are inserted after convolutional layers in:

Encoder’s FPN (Feature Pyramid Network)
Memory-encoder’s CNN blocks
Mask-decoder’s final convolutional refinements

Each CNN-Adapter consists of a $1 \times 1$ convolution channel reduction, $3 \times 3$ depth-wise convolution, GeLU activation, LayerNorm, $1 \times 1$ convolution channel restoration, and residual skip. This adds roughly $C^2/2 + 2.25C$ parameters per module.

Together, the adapters contribute less than 1% additional parameters to the frozen $\sim$ 300M parameter SAM 2 backbone, yet significantly enhance spatial inductive bias and adaptation to multi-modal medical features (Xie et al., 4 Feb 2025).

2. UNet-Based Prompt Generator

To mitigate precise prompt dependency, RFMedSAM 2 incorporates a standalone 3D UNet for fully automated prompt generation.

Input: A patch of shape $(D, H, W)$ (e.g., $D=32$ , $H=W=256$ ), single-channel CT intensities.
Encoder: Four down-sampling stages, each doubling the channel count from 32 up to 256, with $3\times3\times3$ convolutions, ReLU, instance-norm, and $2\times2\times2$ max-pooling.
Decoder: Four stages, progressively halving channels, using transposed convolution, skip connections, $3\times3\times3$ convolutions, ReLU, and instance-norm.
Output: Nc-channel soft mask volume $(N_c,D,H,W)$ , where $N_c$ is the organ class count (e.g., 13 for BTCV).
Deep supervision: Auxiliary $1\times1\times1$ convolutional outputs at each decoder level, enabling multi-resolution supervision.

Bounding boxes for each organ are extracted from the generated masks as axis-aligned boxes $(z,y,x)$ , converted into SAM 2’s bounding box and point prompts per slice.

Loss at each output resolution $r$ is defined as:

$L_r = L_{CE}(p^r, y^r) + L_{Dice}(p^r, y^r)$

The total deep-supervised loss is:

$L = \sum_{r=1}^{R} w_r L_r$

with $w_r = 1/2^{r-1}$ , $\sum_r w_r=1$ , and $y^r$ as the down-sampled ground truth.

RFMedSAM 2 employs a three-step volumetric segmentation pipeline:

Step 0: UNet Mask Prediction
- Input: CT volume (or sub-volume)
- Output: Organ masks $M^0$ $\to$ bounding boxes $B^0$
Step 1: SAM 2 Mask Generation
- Prompts: $B^0$ , center points
- Output: Refined masks $M^1$ , updated object-pointer scores, new bounding boxes $B^1$
Step 2: Memory-Attention Refinement
- Prompts: $B^1$ plus up to six preceding slice features
- Output: Final mask $M^2$

Empirically, this pipeline improves Dice at each stage; for BTCV, Dice progresses 0.856 → 0.864 → 0.867, and for AMOS2022, 0.895 → 0.898 → 0.907.

Sample inference pseudocode:

M0 = UNet3D(input_patch)
B0 = bboxes_from_masks(M0)

mask1, obj_scores = SAM2_stage1(input_patch, prompts=B0)
B1 = bboxes_from_masks(mask1)

mask2 = SAM2_stage2(input_patch, prompts=B1, memory=prev6_frames)
return mask2

4. Datasets, Preprocessing, and Training Protocols

Experiments are conducted on BTCV and AMOS 2022 datasets.

BTCV: 30 abdominal CT volumes, 13 organs, 24 train / 4 validation, no standard test set. Volumes are resampled to $1\times1\times1$ mm, intensities clipped $[-200,300]$ HU, normalized, cropped/sliding window patches.

AMOS 2022: 200 CT scans, 16 organs, five-fold cross-validation, challenge test set withheld. Preprocessing matches BTCV.

Training utilizes SGD (momentum=0.99, weight decay= $3\times10^{-5}$ ), initial lr= $1\times10^{-3}$ , “poly” schedule:

$\mathit{lr}(e)=\mathit{lr}_0\left(1 - \frac{e}{E_{\max}}\right)^{0.9},\quad E_{\max}=1000$

Batch size per GPU is 1 (limited by 3D memory). Extensive on-the-fly augmentations are employed: random rotation/scaling, Gaussian noise/blur, contrast/brightness jitter, gamma variation, simulated low resolution, and random spatial mirroring.

5. Quantitative Results and Ablations

RFMedSAM 2 achieves state-of-the-art results across key metrics.

AMOS 2022:

nnUNet (5-fold cv): 0.878
RFMedSAM 2 (learnable prompts, no GT): 0.907 (+2.9%)
Other prompt-free SAMs: SAMed 0.772, SAM3D 0.751

BTCV:

nnUNet: 0.802
RFMedSAM 2 w/ GT boxes: 0.923 (+12%)
RFMedSAM 2 (learnable U-Net masks, two-stage SAM): 0.867 (+6.5%)
Other prompt-free SAMs: SAMed 0.776, SAM3D 0.801

Key ablations reveal the importance of frame selection (current + 6 previous slices: +0.84% DSC), DWConvAdapter insertion (+0.47% DSC), CNN-Adapter insertion (+0.25% DSC), and prompt generation strategy: learnable U-Net masks outperform learnable bounding boxes and variants without object-score supervision.

Segmentation quality is tightly linked to prompt generation and memory refinement stages, with each step contributing incremental improvements.

6. Mathematical Formulation of Segmentation Metrics and Losses

The Dice Similarity Coefficient (DSC) for predicted binary volumes $X$ and ground truth $Y$ is:

$\mathrm{DSC}(X, Y) = \frac{2 |X \cap Y|}{|X| + |Y|}$

Operatively, for predicted softmaps $p$ and ground truth $y$ :

$\mathrm{DSC}(p, y) = \frac{2 \sum_i p_i y_i}{\sum_i p_i + \sum_i y_i}$

Loss functions include multi-class cross-entropy and soft Dice. For output $p$ , ground truth $y$ , and classes $c=1\ldots N_c$ :

$L_{CE}(p, y) = - \sum_{c=1}^{N_c} y_{c} \log p_{c}$

$L_{Dice}(p, y) = 1 - \frac{2\sum_i p_i y_i + \epsilon}{\sum_i p_i + \sum_i y_i + \epsilon}$

Combined supervised loss at each resolution:

$L_r = L_{CE}(p^r, y^r) + L_{Dice}(p^r, y^r)$

With deep supervision for $R$ resolutions:

$L = \sum_{r=1}^{R} w_r L_r$

where $w_r = 1/2^{r-1}$ and $\sum_r w_r=1$ .

7. Context, Significance, and Implications

RFMedSAM 2 demonstrates that prompt dependency and lack of semantic output in SAM/SAM 2 are surmountable through joint automated prompt generation, adapter-based fine-tuning, and temporal refinement. The approach achieves segmentation yields ( $\sim$ 0.87–0.92 Dice) that substantially exceed prior SOTA benchmarks for volumetric CT organ segmentation, while requiring minimal additional learnable parameters.

A plausible implication is that further integration of multi-modal data and more sophisticated prompt-generation modules could extend the system’s applicability beyond segmenting abdominal CT to other volumetric medical image modalities. This suggests the methodology is robust and adaptable, pending careful calibration of the inductive bias and prompt supervision mechanisms.

RFMedSAM 2 provides a reproducible baseline for adapter-based application of prompt-driven foundation models to clinical segmentation, as substantiated by comprehensive ablation studies, explicit loss definitions, and established dataset protocols (Xie et al., 4 Feb 2025).

PDF Markdown Chat (Pro)

References (1)

RFMedSAM 2: Automatic Prompt Refinement for Enhanced Volumetric Medical Image Segmentation with SAM 2 (2025)

RFMedSAM 2: Automated Medical Segmentation

1. Architecture and Fine-Tuning Adaptation Mechanisms

2. UNet-Based Prompt Generator

3. Dual-Stage Prompt and Mask Refinement Pipeline

4. Datasets, Preprocessing, and Training Protocols

5. Quantitative Results and Ablations

6. Mathematical Formulation of Segmentation Metrics and Losses

7. Context, Significance, and Implications

Whiteboard

Follow Topic

Continue Learning

RFMedSAM 2: Automated Medical Segmentation

1. Architecture and Fine-Tuning Adaptation Mechanisms

2. UNet-Based Prompt Generator

3. Dual-Stage Prompt and Mask Refinement Pipeline

4. Datasets, Preprocessing, and Training Protocols

5. Quantitative Results and Ablations

6. Mathematical Formulation of Segmentation Metrics and Losses

7. Context, Significance, and Implications

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics