Diffusion-Empowered AutoMedSAM

Updated 25 November 2025

The paper introduces AutoMedSAM, an end-to-end framework that automates semantic segmentation via a diffusion-based dual-branch prompt encoder.
It leverages a joint uncertainty-aware multi-loss strategy and adapts MedSAM’s backbone to optimize class-specific mask prediction.
Empirical evaluations across CT, MR, and X-ray modalities demonstrate superior segmentation accuracy and robust cross-dataset generalization.

Diffusion-Empowered AutoPrompt MedSAM (AutoMedSAM) is an end-to-end medical image segmentation framework that extends the Segment Anything Model (SAM) and its medical adaptation, MedSAM. Addressing the notable challenges of manual prompt dependency and lack of semantic labeling in conventional MedSAM, AutoMedSAM integrates a diffusion-based dual-branch prompt encoder to automate class-conditioned segmentation. This framework enables fully automated mask prediction with semantic association, optimized via a joint uncertainty-aware multi-loss strategy, and demonstrates superior segmentation accuracy and generalization across multiple clinical imaging modalities (Huang et al., 5 Feb 2025).

1. Architecture Overview

AutoMedSAM retains the architectural backbone of MedSAM, composed of a frozen image encoder $E_I$ and a mask decoder $D_M$ , while fundamentally replacing the manual prompt encoder with a diffusion-based class prompt encoder $E_P$ . The input image $I\in\mathbb R^{h\times w\times3}$ is encoded to feature maps: $F_I = E_I(I),\quad F_I\in\mathbb R^{B\times C\times H\times W}.$ Given an anatomical class index $c$ , the encoder $E_P$ generates two prompt embeddings: $(P_s^{(c)}, P_d^{(c)}) = E_P(F_I, c),$ where $P_s^{(c)}$ (sparse prompt) encodes global cues and $P_d^{(c)}$ (dense prompt) encodes local features. The mask decoder $D_M$ then combines the image features, positional embedding $P_p$ , and both prompt vectors to predict the segmentation mask: $M^{(c)} = D_M(F_I, P_p, P_s^{(c)}, P_d^{(c)}).$ This pipeline eliminates the need for manual clicks, boxes, or scribbles and embeds semantic class information directly into the segmentation masks (Huang et al., 5 Feb 2025).

2. Diffusion-Based Class Prompt Encoder Design

AutoMedSAM’s class prompt encoder $E_P$ operates as a conditional diffusion model. The class index $c$ is projected and reshaped for conditioning: $c_{\text{proj}} = W_c c + b_c, \quad c_{\text{expand}} = \operatorname{reshape}(c_{\text{proj}}) \in \mathbb R^{B\times1\times H\times W}.$ For forward diffusion, isotropic Gaussian noise $\epsilon_t \sim \mathcal N(0, \sigma_t^2 I)$ with $\sigma_t = 1/(t+1)$ is added to the image feature,

$F_t = F_I + \epsilon_t + c_{\text{expand}}.$

This forms the noisy, class-conditioned embedding.

The reverse diffusion employs a U-Net structure, processing $F_t$ through convolutional layers with class re-injection at each layer. The encoder branches into:

Dense/local branch: Element-wise attention is computed,

$A_{\rm dense}^{(\ell)} = \sigma(W_{\rm att}^{(\ell)} * F_{\rm att}^{(\ell)} + b_{\rm att}^{(\ell)}),$

followed by masked feature multiplication and upsampling to produce $P_d^{(c)}$ .

Sparse/global branch: Channel attention leverages spatially average-pooled features,

$F_{\rm global}^{(\ell)} = \mathrm{AdaptiveAvgPool2D}(F_{\rm att}^{(\ell)}),$

and produces $P_s^{(c)}$ via channelwise scaling.

Final prompt embeddings are typically concatenated: $P^{(c)} = [P_s^{(c)}; P_d^{(c)}].$ This enables integration of both fine-grained and global context within the prompt representation (Huang et al., 5 Feb 2025).

3. Prompt Integration and Segmentation Mask Generation

The mask decoder $D_M$ incorporates prompt embeddings $P^{(c)}$ via cross-attention mechanisms: $Q = W_Q F_I, \quad K = W_K P^{(c)}, \quad V = W_V P^{(c)},$

$\operatorname{Attn}(Q, K, V) = \operatorname{Softmax}\Bigl(\frac{Q K^\top}{\sqrt{d_k}}\Bigr) V.$

Semantic prompt features are injected into the decoder’s latent space, ensuring that output masks $M^{(c)}$ encode both object shape and class semantics. This design provides fully automated semantic segmentation for specified anatomical classes, broadening utility for both clinical and non-expert contexts (Huang et al., 5 Feb 2025).

4. Joint Training with Uncertainty-Aware Loss Balancing

AutoMedSAM is optimized with a joint objective comprising five loss components:

Sparse prompt MSE:

$L_{\rm MSE}^S = \frac{1}{n}\sum_{c,i} \| P_{s,i}^{(c)} - \widehat P_{s,i}^{(c)} \|_2^2.$

Dense prompt MSE:

$L_{\rm MSE}^D = \frac{1}{n}\sum_{c,i} \| P_{d,i}^{(c)} - \widehat P_{d,i}^{(c)} \|_2^2.$

Dice loss:

$L_{\rm DC} = 1 - \frac{2\sum_{c,i}M_i^{(c)}\widehat M_i^{(c)}}{\sum_{c,i}(M_i^{(c)})^2 + \sum_{c,i}(\widehat M_i^{(c)})^2}.$

Cross-entropy loss:

$L_{\rm CE} = -\frac{1}{n}\sum_{c,i}[M_i^{(c)} \log\widehat M_i^{(c)} + (1-M_i^{(c)})\log(1-\widehat M_i^{(c)})].$

Shape-distance loss:

$L_{\rm SD} = \frac{1}{n h} \sum_{i,ch} \frac{\sum_{h,w}|D(M_{i,ch}^{(c)})(h,w) - \widehat M_{i,ch}^{(c)}(h,w)|}{\sum_{h,w} \widehat M_{i,ch}^{(c)}(h,w)}.$

Loss terms are dynamically weighted using the uncertainty weighting framework of Tsai et al.: $L_{\rm total} = \sum_{j} \left(\frac{1}{2 \lambda_j^2} L_j + \log(1 + \lambda_j^2)\right).$ This obviates manual tuning of loss weights and facilitates balanced learning across heterogeneous objectives (Huang et al., 5 Feb 2025).

5. Training Procedure

During training, the image encoder $E_I$ remains frozen while $E_P$ and $D_M$ are updated. Optimization employs AdamW with learning rate 5e-4, $\beta_1 = 0.9$ , $\beta_2 = 0.999$ , and $\epsilon = 10^{-8}$ , using a reduce-on-plateau scheduler (factor 0.9, patience 5), batch size 16, up to 100 epochs. The core process follows:

for epoch in 1..100:
  for (I, c, M_gt) in train_loader:
    F_I = E_I(I)                        # frozen image encoder
    t = random.randint(0, T-1)
    ε_t = N(0, (1/(t+1))^2)
    F_t = F_I + c_expand(c) + ε_t       # forward diffusion
    (P_s, P_d) = E_P.reverse_diffusion(F_t, c)  # prompt encoding
    M_pred = D_M(F_I, P_p, P_s, P_d)    # mask prediction
    L1 = MSE(P_s, MedSAM_s)
    L2 = MSE(P_d, MedSAM_d)
    L3 = Dice(M_pred, M_gt)
    L4 = CE(M_pred, M_gt)
    L5 = ShapeDist(M_pred, M_gt)
    L = uncertainty_weighted(L1, L2, L3, L4, L5)
    L.backward()
    optimizer.step()

This strategy enables efficient transfer of MedSAM’s pre-trained image representations while adapting the prompt and mask decoder modules to the fully automated, class-specific task (Huang et al., 5 Feb 2025).

6. Empirical Evaluation

AutoMedSAM is evaluated across diverse medical imaging datasets: AbdomenCT-1K (CT, 5 organs), BraTS (MR-FLAIR, tumor), Kvasir-SEG (endoscopy, polyp), Chest-XML (X-ray, lung), and in cross-dataset scenarios (AMOS, BraTS-T1CE). Performance is measured using Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD).

Representative Quantitative Results (AbdomenCT-1K):

Method	DSC (%)	NSD (%)
MedSAM	93.505	92.969
SurgicalSAM	75.505	70.119
AutoMedSAM (O)	94.580	95.148

On single-object datasets (BraTS, Kvasir, Chest-XML), AutoMedSAM outperforms all baselines by 1–5 points in DSC and NSD. Cross-dataset evaluation (train: AbdomenCT, test: AMOS) yields DSC 71.14% for AutoMedSAM vs. 56.93% for SurgicalSAM. Ablation studies demonstrate the benefits of dual-branch prompts, diffusion, and uncertainty weighting (Huang et al., 5 Feb 2025).

7. Strengths, Limitations, and Future Directions

AutoMedSAM delivers a fully automated, semantically labeled segmentation workflow, eliminating manual prompt annotation and enabling class-aware mask generation. The dual-branch diffusion encoder captures both global and local context, and uncertainty weighting harmonizes joint optimization. Nevertheless, computational overhead from diffusion steps is nontrivial, and current deployments require a predefined class index set, precluding open-vocabulary extension. There may be performance degradation on extremely small or highly noisy structures. Future work will target lightweight diffusion models, open-set recognition, and scaling to 3D volumetric data.

AutoMedSAM establishes a state-of-the-art, prompt-free, and semantically explicit segmentation paradigm for clinical and non-expert end users (Huang et al., 5 Feb 2025).

PDF Markdown Chat (Pro)

References (1)

Diffusion-empowered AutoPrompt MedSAM (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Diffusion-Empowered AutoPrompt MedSAM.