Papers
Topics
Authors
Recent
2000 character limit reached

RFMedSAM 2: Automated Medical Segmentation

Updated 1 December 2025
  • The paper presents RFMedSAM 2, which integrates a 3D UNet for automated prompt generation with adapterized SAM2 and a dual-stage refinement pipeline to boost Dice scores.
  • It employs lightweight DWConvAdapter and CNN-Adapters that add less than 1% extra parameters, enhancing spatial inductive bias and multi-modal feature adaptation.
  • The system achieves state-of-the-art performance on BTCV and AMOS2022 datasets, incrementally improving Dice from 0.856 to 0.907 in multi-organ CT segmentation.

RFMedSAM 2 is an automated volumetric medical image segmentation system that integrates a prompt-generating 3D UNet with “adapterized” Segment Anything Model 2 (SAM 2) and a dual-stage refinement pipeline, achieving state-of-the-art segmentation accuracy on multi-organ CT datasets. It addresses prompt sensitivity and binary mask limitations of the original SAM/SAM 2 frameworks by coupling deep supervision, spatially inductive adapters, prompt learning, and temporal refinement.

1. Architecture and Fine-Tuning Adaptation Mechanisms

RFMedSAM 2 builds upon SAM 2, which comprises an image encoder (Hiera), prompt encoder, and mask decoder, supplemented with a memory encoder utilizing transformer-based attention for temporal information fusion. The adaptation in RFMedSAM 2 is implemented via two categories of lightweight, trainable modules: Depth-wise-Convolutional Adapters (DWConvAdapter) and CNN-Adapters.

DWConvAdapter modules are integrated after each image embedding attention block in:

  • Hiera encoder transformer blocks
  • Memory-attention transformer blocks
  • Mask decoder transformer blocks

Each DWConvAdapter applies LayerNorm, linear projection, point-wise convolution, depth-wise convolution (3×33 \times 3), GeLU activation, LayerNorm, another point-wise convolution, and a residual skip preserving original embeddings. For C=256C = 256 channels, a typical DWConvAdapter introduces approximately 2,816 parameters.

CNN-Adapters are inserted after convolutional layers in:

  • Encoder’s FPN (Feature Pyramid Network)
  • Memory-encoder’s CNN blocks
  • Mask-decoder’s final convolutional refinements

Each CNN-Adapter consists of a 1×11 \times 1 convolution channel reduction, 3×33 \times 3 depth-wise convolution, GeLU activation, LayerNorm, 1×11 \times 1 convolution channel restoration, and residual skip. This adds roughly C2/2+2.25CC^2/2 + 2.25C parameters per module.

Together, the adapters contribute less than 1% additional parameters to the frozen \sim300M parameter SAM 2 backbone, yet significantly enhance spatial inductive bias and adaptation to multi-modal medical features (Xie et al., 4 Feb 2025).

2. UNet-Based Prompt Generator

To mitigate precise prompt dependency, RFMedSAM 2 incorporates a standalone 3D UNet for fully automated prompt generation.

  • Input: A patch of shape (D,H,W)(D, H, W) (e.g., D=32D=32, H=W=256H=W=256), single-channel CT intensities.
  • Encoder: Four down-sampling stages, each doubling the channel count from 32 up to 256, with 3×3×33\times3\times3 convolutions, ReLU, instance-norm, and 2×2×22\times2\times2 max-pooling.
  • Decoder: Four stages, progressively halving channels, using transposed convolution, skip connections, 3×3×33\times3\times3 convolutions, ReLU, and instance-norm.
  • Output: Nc-channel soft mask volume (Nc,D,H,W)(N_c,D,H,W), where NcN_c is the organ class count (e.g., 13 for BTCV).
  • Deep supervision: Auxiliary 1×1×11\times1\times1 convolutional outputs at each decoder level, enabling multi-resolution supervision.

Bounding boxes for each organ are extracted from the generated masks as axis-aligned boxes (z,y,x)(z,y,x), converted into SAM 2’s bounding box and point prompts per slice.

Loss at each output resolution rr is defined as:

Lr=LCE(pr,yr)+LDice(pr,yr)L_r = L_{CE}(p^r, y^r) + L_{Dice}(p^r, y^r)

The total deep-supervised loss is:

L=r=1RwrLrL = \sum_{r=1}^{R} w_r L_r

with wr=1/2r1w_r = 1/2^{r-1}, rwr=1\sum_r w_r=1, and yry^r as the down-sampled ground truth.

3. Dual-Stage Prompt and Mask Refinement Pipeline

RFMedSAM 2 employs a three-step volumetric segmentation pipeline:

  1. Step 0: UNet Mask Prediction
    • Input: CT volume (or sub-volume)
    • Output: Organ masks M0M^0 \to bounding boxes B0B^0
  2. Step 1: SAM 2 Mask Generation
    • Prompts: B0B^0, center points
    • Output: Refined masks M1M^1, updated object-pointer scores, new bounding boxes B1B^1
  3. Step 2: Memory-Attention Refinement
    • Prompts: B1B^1 plus up to six preceding slice features
    • Output: Final mask M2M^2

Empirically, this pipeline improves Dice at each stage; for BTCV, Dice progresses 0.856 → 0.864 → 0.867, and for AMOS2022, 0.895 → 0.898 → 0.907.

Sample inference pseudocode:

1
2
3
4
5
6
7
8
M0 = UNet3D(input_patch)
B0 = bboxes_from_masks(M0)

mask1, obj_scores = SAM2_stage1(input_patch, prompts=B0)
B1 = bboxes_from_masks(mask1)

mask2 = SAM2_stage2(input_patch, prompts=B1, memory=prev6_frames)
return mask2

4. Datasets, Preprocessing, and Training Protocols

Experiments are conducted on BTCV and AMOS 2022 datasets.

BTCV: 30 abdominal CT volumes, 13 organs, 24 train / 4 validation, no standard test set. Volumes are resampled to 1×1×11\times1\times1 mm, intensities clipped [200,300][-200,300] HU, normalized, cropped/sliding window patches.

AMOS 2022: 200 CT scans, 16 organs, five-fold cross-validation, challenge test set withheld. Preprocessing matches BTCV.

Training utilizes SGD (momentum=0.99, weight decay=3×1053\times10^{-5}), initial lr=1×1031\times10^{-3}, “poly” schedule:

lr(e)=lr0(1eEmax)0.9,Emax=1000\mathit{lr}(e)=\mathit{lr}_0\left(1 - \frac{e}{E_{\max}}\right)^{0.9},\quad E_{\max}=1000

Batch size per GPU is 1 (limited by 3D memory). Extensive on-the-fly augmentations are employed: random rotation/scaling, Gaussian noise/blur, contrast/brightness jitter, gamma variation, simulated low resolution, and random spatial mirroring.

5. Quantitative Results and Ablations

RFMedSAM 2 achieves state-of-the-art results across key metrics.

AMOS 2022:

  • nnUNet (5-fold cv): 0.878
  • RFMedSAM 2 (learnable prompts, no GT): 0.907 (+2.9%)
  • Other prompt-free SAMs: SAMed 0.772, SAM3D 0.751

BTCV:

  • nnUNet: 0.802
  • RFMedSAM 2 w/ GT boxes: 0.923 (+12%)
  • RFMedSAM 2 (learnable U-Net masks, two-stage SAM): 0.867 (+6.5%)
  • Other prompt-free SAMs: SAMed 0.776, SAM3D 0.801

Key ablations reveal the importance of frame selection (current + 6 previous slices: +0.84% DSC), DWConvAdapter insertion (+0.47% DSC), CNN-Adapter insertion (+0.25% DSC), and prompt generation strategy: learnable U-Net masks outperform learnable bounding boxes and variants without object-score supervision.

Segmentation quality is tightly linked to prompt generation and memory refinement stages, with each step contributing incremental improvements.

6. Mathematical Formulation of Segmentation Metrics and Losses

The Dice Similarity Coefficient (DSC) for predicted binary volumes XX and ground truth YY is:

DSC(X,Y)=2XYX+Y\mathrm{DSC}(X, Y) = \frac{2 |X \cap Y|}{|X| + |Y|}

Operatively, for predicted softmaps pp and ground truth yy:

DSC(p,y)=2ipiyiipi+iyi\mathrm{DSC}(p, y) = \frac{2 \sum_i p_i y_i}{\sum_i p_i + \sum_i y_i}

Loss functions include multi-class cross-entropy and soft Dice. For output pp, ground truth yy, and classes c=1Ncc=1\ldots N_c:

LCE(p,y)=c=1NcyclogpcL_{CE}(p, y) = - \sum_{c=1}^{N_c} y_{c} \log p_{c}

LDice(p,y)=12ipiyi+ϵipi+iyi+ϵL_{Dice}(p, y) = 1 - \frac{2\sum_i p_i y_i + \epsilon}{\sum_i p_i + \sum_i y_i + \epsilon}

Combined supervised loss at each resolution:

Lr=LCE(pr,yr)+LDice(pr,yr)L_r = L_{CE}(p^r, y^r) + L_{Dice}(p^r, y^r)

With deep supervision for RR resolutions:

L=r=1RwrLrL = \sum_{r=1}^{R} w_r L_r

where wr=1/2r1w_r = 1/2^{r-1} and rwr=1\sum_r w_r=1.

7. Context, Significance, and Implications

RFMedSAM 2 demonstrates that prompt dependency and lack of semantic output in SAM/SAM 2 are surmountable through joint automated prompt generation, adapter-based fine-tuning, and temporal refinement. The approach achieves segmentation yields (\sim0.87–0.92 Dice) that substantially exceed prior SOTA benchmarks for volumetric CT organ segmentation, while requiring minimal additional learnable parameters.

A plausible implication is that further integration of multi-modal data and more sophisticated prompt-generation modules could extend the system’s applicability beyond segmenting abdominal CT to other volumetric medical image modalities. This suggests the methodology is robust and adaptable, pending careful calibration of the inductive bias and prompt supervision mechanisms.

RFMedSAM 2 provides a reproducible baseline for adapter-based application of prompt-driven foundation models to clinical segmentation, as substantiated by comprehensive ablation studies, explicit loss definitions, and established dataset protocols (Xie et al., 4 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RFMedSAM 2.