RFMedSAM 2: Automated Medical Segmentation
- The paper presents RFMedSAM 2, which integrates a 3D UNet for automated prompt generation with adapterized SAM2 and a dual-stage refinement pipeline to boost Dice scores.
- It employs lightweight DWConvAdapter and CNN-Adapters that add less than 1% extra parameters, enhancing spatial inductive bias and multi-modal feature adaptation.
- The system achieves state-of-the-art performance on BTCV and AMOS2022 datasets, incrementally improving Dice from 0.856 to 0.907 in multi-organ CT segmentation.
RFMedSAM 2 is an automated volumetric medical image segmentation system that integrates a prompt-generating 3D UNet with “adapterized” Segment Anything Model 2 (SAM 2) and a dual-stage refinement pipeline, achieving state-of-the-art segmentation accuracy on multi-organ CT datasets. It addresses prompt sensitivity and binary mask limitations of the original SAM/SAM 2 frameworks by coupling deep supervision, spatially inductive adapters, prompt learning, and temporal refinement.
1. Architecture and Fine-Tuning Adaptation Mechanisms
RFMedSAM 2 builds upon SAM 2, which comprises an image encoder (Hiera), prompt encoder, and mask decoder, supplemented with a memory encoder utilizing transformer-based attention for temporal information fusion. The adaptation in RFMedSAM 2 is implemented via two categories of lightweight, trainable modules: Depth-wise-Convolutional Adapters (DWConvAdapter) and CNN-Adapters.
DWConvAdapter modules are integrated after each image embedding attention block in:
- Hiera encoder transformer blocks
- Memory-attention transformer blocks
- Mask decoder transformer blocks
Each DWConvAdapter applies LayerNorm, linear projection, point-wise convolution, depth-wise convolution (), GeLU activation, LayerNorm, another point-wise convolution, and a residual skip preserving original embeddings. For channels, a typical DWConvAdapter introduces approximately 2,816 parameters.
CNN-Adapters are inserted after convolutional layers in:
- Encoder’s FPN (Feature Pyramid Network)
- Memory-encoder’s CNN blocks
- Mask-decoder’s final convolutional refinements
Each CNN-Adapter consists of a convolution channel reduction, depth-wise convolution, GeLU activation, LayerNorm, convolution channel restoration, and residual skip. This adds roughly parameters per module.
Together, the adapters contribute less than 1% additional parameters to the frozen 300M parameter SAM 2 backbone, yet significantly enhance spatial inductive bias and adaptation to multi-modal medical features (Xie et al., 4 Feb 2025).
2. UNet-Based Prompt Generator
To mitigate precise prompt dependency, RFMedSAM 2 incorporates a standalone 3D UNet for fully automated prompt generation.
- Input: A patch of shape (e.g., , ), single-channel CT intensities.
- Encoder: Four down-sampling stages, each doubling the channel count from 32 up to 256, with convolutions, ReLU, instance-norm, and max-pooling.
- Decoder: Four stages, progressively halving channels, using transposed convolution, skip connections, convolutions, ReLU, and instance-norm.
- Output: Nc-channel soft mask volume , where is the organ class count (e.g., 13 for BTCV).
- Deep supervision: Auxiliary convolutional outputs at each decoder level, enabling multi-resolution supervision.
Bounding boxes for each organ are extracted from the generated masks as axis-aligned boxes , converted into SAM 2’s bounding box and point prompts per slice.
Loss at each output resolution is defined as:
The total deep-supervised loss is:
with , , and as the down-sampled ground truth.
3. Dual-Stage Prompt and Mask Refinement Pipeline
RFMedSAM 2 employs a three-step volumetric segmentation pipeline:
- Step 0: UNet Mask Prediction
- Input: CT volume (or sub-volume)
- Output: Organ masks bounding boxes
- Step 1: SAM 2 Mask Generation
- Prompts: , center points
- Output: Refined masks , updated object-pointer scores, new bounding boxes
- Step 2: Memory-Attention Refinement
- Prompts: plus up to six preceding slice features
- Output: Final mask
Empirically, this pipeline improves Dice at each stage; for BTCV, Dice progresses 0.856 → 0.864 → 0.867, and for AMOS2022, 0.895 → 0.898 → 0.907.
Sample inference pseudocode:
1 2 3 4 5 6 7 8 |
M0 = UNet3D(input_patch)
B0 = bboxes_from_masks(M0)
mask1, obj_scores = SAM2_stage1(input_patch, prompts=B0)
B1 = bboxes_from_masks(mask1)
mask2 = SAM2_stage2(input_patch, prompts=B1, memory=prev6_frames)
return mask2 |
4. Datasets, Preprocessing, and Training Protocols
Experiments are conducted on BTCV and AMOS 2022 datasets.
BTCV: 30 abdominal CT volumes, 13 organs, 24 train / 4 validation, no standard test set. Volumes are resampled to mm, intensities clipped HU, normalized, cropped/sliding window patches.
AMOS 2022: 200 CT scans, 16 organs, five-fold cross-validation, challenge test set withheld. Preprocessing matches BTCV.
Training utilizes SGD (momentum=0.99, weight decay=), initial lr=, “poly” schedule:
Batch size per GPU is 1 (limited by 3D memory). Extensive on-the-fly augmentations are employed: random rotation/scaling, Gaussian noise/blur, contrast/brightness jitter, gamma variation, simulated low resolution, and random spatial mirroring.
5. Quantitative Results and Ablations
RFMedSAM 2 achieves state-of-the-art results across key metrics.
AMOS 2022:
- nnUNet (5-fold cv): 0.878
- RFMedSAM 2 (learnable prompts, no GT): 0.907 (+2.9%)
- Other prompt-free SAMs: SAMed 0.772, SAM3D 0.751
BTCV:
- nnUNet: 0.802
- RFMedSAM 2 w/ GT boxes: 0.923 (+12%)
- RFMedSAM 2 (learnable U-Net masks, two-stage SAM): 0.867 (+6.5%)
- Other prompt-free SAMs: SAMed 0.776, SAM3D 0.801
Key ablations reveal the importance of frame selection (current + 6 previous slices: +0.84% DSC), DWConvAdapter insertion (+0.47% DSC), CNN-Adapter insertion (+0.25% DSC), and prompt generation strategy: learnable U-Net masks outperform learnable bounding boxes and variants without object-score supervision.
Segmentation quality is tightly linked to prompt generation and memory refinement stages, with each step contributing incremental improvements.
6. Mathematical Formulation of Segmentation Metrics and Losses
The Dice Similarity Coefficient (DSC) for predicted binary volumes and ground truth is:
Operatively, for predicted softmaps and ground truth :
Loss functions include multi-class cross-entropy and soft Dice. For output , ground truth , and classes :
Combined supervised loss at each resolution:
With deep supervision for resolutions:
where and .
7. Context, Significance, and Implications
RFMedSAM 2 demonstrates that prompt dependency and lack of semantic output in SAM/SAM 2 are surmountable through joint automated prompt generation, adapter-based fine-tuning, and temporal refinement. The approach achieves segmentation yields (0.87–0.92 Dice) that substantially exceed prior SOTA benchmarks for volumetric CT organ segmentation, while requiring minimal additional learnable parameters.
A plausible implication is that further integration of multi-modal data and more sophisticated prompt-generation modules could extend the system’s applicability beyond segmenting abdominal CT to other volumetric medical image modalities. This suggests the methodology is robust and adaptable, pending careful calibration of the inductive bias and prompt supervision mechanisms.
RFMedSAM 2 provides a reproducible baseline for adapter-based application of prompt-driven foundation models to clinical segmentation, as substantiated by comprehensive ablation studies, explicit loss definitions, and established dataset protocols (Xie et al., 4 Feb 2025).