Papers
Topics
Authors
Recent
2000 character limit reached

Diffusion-Empowered AutoMedSAM

Updated 25 November 2025
  • The paper introduces AutoMedSAM, an end-to-end framework that automates semantic segmentation via a diffusion-based dual-branch prompt encoder.
  • It leverages a joint uncertainty-aware multi-loss strategy and adapts MedSAM’s backbone to optimize class-specific mask prediction.
  • Empirical evaluations across CT, MR, and X-ray modalities demonstrate superior segmentation accuracy and robust cross-dataset generalization.

Diffusion-Empowered AutoPrompt MedSAM (AutoMedSAM) is an end-to-end medical image segmentation framework that extends the Segment Anything Model (SAM) and its medical adaptation, MedSAM. Addressing the notable challenges of manual prompt dependency and lack of semantic labeling in conventional MedSAM, AutoMedSAM integrates a diffusion-based dual-branch prompt encoder to automate class-conditioned segmentation. This framework enables fully automated mask prediction with semantic association, optimized via a joint uncertainty-aware multi-loss strategy, and demonstrates superior segmentation accuracy and generalization across multiple clinical imaging modalities (Huang et al., 5 Feb 2025).

1. Architecture Overview

AutoMedSAM retains the architectural backbone of MedSAM, composed of a frozen image encoder EIE_I and a mask decoder DMD_M, while fundamentally replacing the manual prompt encoder with a diffusion-based class prompt encoder EPE_P. The input image IRh×w×3I\in\mathbb R^{h\times w\times3} is encoded to feature maps: FI=EI(I),FIRB×C×H×W.F_I = E_I(I),\quad F_I\in\mathbb R^{B\times C\times H\times W}. Given an anatomical class index cc, the encoder EPE_P generates two prompt embeddings: (Ps(c),Pd(c))=EP(FI,c),(P_s^{(c)}, P_d^{(c)}) = E_P(F_I, c), where Ps(c)P_s^{(c)} (sparse prompt) encodes global cues and Pd(c)P_d^{(c)} (dense prompt) encodes local features. The mask decoder DMD_M then combines the image features, positional embedding PpP_p, and both prompt vectors to predict the segmentation mask: M(c)=DM(FI,Pp,Ps(c),Pd(c)).M^{(c)} = D_M(F_I, P_p, P_s^{(c)}, P_d^{(c)}). This pipeline eliminates the need for manual clicks, boxes, or scribbles and embeds semantic class information directly into the segmentation masks (Huang et al., 5 Feb 2025).

2. Diffusion-Based Class Prompt Encoder Design

AutoMedSAM’s class prompt encoder EPE_P operates as a conditional diffusion model. The class index cc is projected and reshaped for conditioning: cproj=Wcc+bc,cexpand=reshape(cproj)RB×1×H×W.c_{\text{proj}} = W_c c + b_c, \quad c_{\text{expand}} = \operatorname{reshape}(c_{\text{proj}}) \in \mathbb R^{B\times1\times H\times W}. For forward diffusion, isotropic Gaussian noise ϵtN(0,σt2I)\epsilon_t \sim \mathcal N(0, \sigma_t^2 I) with σt=1/(t+1)\sigma_t = 1/(t+1) is added to the image feature,

Ft=FI+ϵt+cexpand.F_t = F_I + \epsilon_t + c_{\text{expand}}.

This forms the noisy, class-conditioned embedding.

The reverse diffusion employs a U-Net structure, processing FtF_t through convolutional layers with class re-injection at each layer. The encoder branches into:

  • Dense/local branch: Element-wise attention is computed,

Adense()=σ(Watt()Fatt()+batt()),A_{\rm dense}^{(\ell)} = \sigma(W_{\rm att}^{(\ell)} * F_{\rm att}^{(\ell)} + b_{\rm att}^{(\ell)}),

followed by masked feature multiplication and upsampling to produce Pd(c)P_d^{(c)}.

  • Sparse/global branch: Channel attention leverages spatially average-pooled features,

Fglobal()=AdaptiveAvgPool2D(Fatt()),F_{\rm global}^{(\ell)} = \mathrm{AdaptiveAvgPool2D}(F_{\rm att}^{(\ell)}),

and produces Ps(c)P_s^{(c)} via channelwise scaling.

Final prompt embeddings are typically concatenated: P(c)=[Ps(c);Pd(c)].P^{(c)} = [P_s^{(c)}; P_d^{(c)}]. This enables integration of both fine-grained and global context within the prompt representation (Huang et al., 5 Feb 2025).

3. Prompt Integration and Segmentation Mask Generation

The mask decoder DMD_M incorporates prompt embeddings P(c)P^{(c)} via cross-attention mechanisms: Q=WQFI,K=WKP(c),V=WVP(c),Q = W_Q F_I, \quad K = W_K P^{(c)}, \quad V = W_V P^{(c)},

Attn(Q,K,V)=Softmax(QKdk)V.\operatorname{Attn}(Q, K, V) = \operatorname{Softmax}\Bigl(\frac{Q K^\top}{\sqrt{d_k}}\Bigr) V.

Semantic prompt features are injected into the decoder’s latent space, ensuring that output masks M(c)M^{(c)} encode both object shape and class semantics. This design provides fully automated semantic segmentation for specified anatomical classes, broadening utility for both clinical and non-expert contexts (Huang et al., 5 Feb 2025).

4. Joint Training with Uncertainty-Aware Loss Balancing

AutoMedSAM is optimized with a joint objective comprising five loss components:

  1. Sparse prompt MSE:

LMSES=1nc,iPs,i(c)P^s,i(c)22.L_{\rm MSE}^S = \frac{1}{n}\sum_{c,i} \| P_{s,i}^{(c)} - \widehat P_{s,i}^{(c)} \|_2^2.

  1. Dense prompt MSE:

LMSED=1nc,iPd,i(c)P^d,i(c)22.L_{\rm MSE}^D = \frac{1}{n}\sum_{c,i} \| P_{d,i}^{(c)} - \widehat P_{d,i}^{(c)} \|_2^2.

  1. Dice loss:

LDC=12c,iMi(c)M^i(c)c,i(Mi(c))2+c,i(M^i(c))2.L_{\rm DC} = 1 - \frac{2\sum_{c,i}M_i^{(c)}\widehat M_i^{(c)}}{\sum_{c,i}(M_i^{(c)})^2 + \sum_{c,i}(\widehat M_i^{(c)})^2}.

  1. Cross-entropy loss:

LCE=1nc,i[Mi(c)logM^i(c)+(1Mi(c))log(1M^i(c))].L_{\rm CE} = -\frac{1}{n}\sum_{c,i}[M_i^{(c)} \log\widehat M_i^{(c)} + (1-M_i^{(c)})\log(1-\widehat M_i^{(c)})].

  1. Shape-distance loss:

LSD=1nhi,chh,wD(Mi,ch(c))(h,w)M^i,ch(c)(h,w)h,wM^i,ch(c)(h,w).L_{\rm SD} = \frac{1}{n h} \sum_{i,ch} \frac{\sum_{h,w}|D(M_{i,ch}^{(c)})(h,w) - \widehat M_{i,ch}^{(c)}(h,w)|}{\sum_{h,w} \widehat M_{i,ch}^{(c)}(h,w)}.

Loss terms are dynamically weighted using the uncertainty weighting framework of Tsai et al.: Ltotal=j(12λj2Lj+log(1+λj2)).L_{\rm total} = \sum_{j} \left(\frac{1}{2 \lambda_j^2} L_j + \log(1 + \lambda_j^2)\right). This obviates manual tuning of loss weights and facilitates balanced learning across heterogeneous objectives (Huang et al., 5 Feb 2025).

5. Training Procedure

During training, the image encoder EIE_I remains frozen while EPE_P and DMD_M are updated. Optimization employs AdamW with learning rate 5e-4, β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, and ϵ=108\epsilon = 10^{-8}, using a reduce-on-plateau scheduler (factor 0.9, patience 5), batch size 16, up to 100 epochs. The core process follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for epoch in 1..100:
  for (I, c, M_gt) in train_loader:
    F_I = E_I(I)                        # frozen image encoder
    t = random.randint(0, T-1)
    ε_t = N(0, (1/(t+1))^2)
    F_t = F_I + c_expand(c) + ε_t       # forward diffusion
    (P_s, P_d) = E_P.reverse_diffusion(F_t, c)  # prompt encoding
    M_pred = D_M(F_I, P_p, P_s, P_d)    # mask prediction
    L1 = MSE(P_s, MedSAM_s)
    L2 = MSE(P_d, MedSAM_d)
    L3 = Dice(M_pred, M_gt)
    L4 = CE(M_pred, M_gt)
    L5 = ShapeDist(M_pred, M_gt)
    L = uncertainty_weighted(L1, L2, L3, L4, L5)
    L.backward()
    optimizer.step()
This strategy enables efficient transfer of MedSAM’s pre-trained image representations while adapting the prompt and mask decoder modules to the fully automated, class-specific task (Huang et al., 5 Feb 2025).

6. Empirical Evaluation

AutoMedSAM is evaluated across diverse medical imaging datasets: AbdomenCT-1K (CT, 5 organs), BraTS (MR-FLAIR, tumor), Kvasir-SEG (endoscopy, polyp), Chest-XML (X-ray, lung), and in cross-dataset scenarios (AMOS, BraTS-T1CE). Performance is measured using Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD).

Representative Quantitative Results (AbdomenCT-1K):

Method DSC (%) NSD (%)
MedSAM 93.505 92.969
SurgicalSAM 75.505 70.119
AutoMedSAM (O) 94.580 95.148

On single-object datasets (BraTS, Kvasir, Chest-XML), AutoMedSAM outperforms all baselines by 1–5 points in DSC and NSD. Cross-dataset evaluation (train: AbdomenCT, test: AMOS) yields DSC 71.14% for AutoMedSAM vs. 56.93% for SurgicalSAM. Ablation studies demonstrate the benefits of dual-branch prompts, diffusion, and uncertainty weighting (Huang et al., 5 Feb 2025).

7. Strengths, Limitations, and Future Directions

AutoMedSAM delivers a fully automated, semantically labeled segmentation workflow, eliminating manual prompt annotation and enabling class-aware mask generation. The dual-branch diffusion encoder captures both global and local context, and uncertainty weighting harmonizes joint optimization. Nevertheless, computational overhead from diffusion steps is nontrivial, and current deployments require a predefined class index set, precluding open-vocabulary extension. There may be performance degradation on extremely small or highly noisy structures. Future work will target lightweight diffusion models, open-set recognition, and scaling to 3D volumetric data.

AutoMedSAM establishes a state-of-the-art, prompt-free, and semantically explicit segmentation paradigm for clinical and non-expert end users (Huang et al., 5 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Diffusion-Empowered AutoPrompt MedSAM.