Diffusion-empowered AutoPrompt MedSAM

Published 5 Feb 2025 in eess.IV, cs.GR, and cs.LG | (2502.06817v2)

Abstract: MedSAM, a medical foundation model derived from the SAM architecture, has demonstrated notable success across diverse medical domains. However, its clinical application faces two major challenges: the dependency on labor-intensive manual prompt generation, which imposes a significant burden on clinicians, and the absence of semantic labeling in the generated segmentation masks for organs or lesions, limiting its practicality for non-expert users. To address these limitations, we propose AutoMedSAM, an end-to-end framework derived from SAM, designed to enhance usability and segmentation performance. AutoMedSAM retains MedSAM's image encoder and mask decoder structure while introducing a novel diffusion-based class prompt encoder. The diffusion-based encoder employs a dual-decoder structure to collaboratively generate prompt embeddings guided by sparse and dense prompt definitions. These embeddings enhance the model's ability to understand and process clinical imagery autonomously. With this encoder, AutoMedSAM leverages class prompts to embed semantic information into the model's predictions, transforming MedSAM's semi-automated pipeline into a fully automated workflow. Furthermore, AutoMedSAM employs an uncertainty-aware joint optimization strategy during training to effectively inherit MedSAM's pre-trained knowledge while improving generalization by integrating multiple loss functions. Experimental results across diverse datasets demonstrate that AutoMedSAM achieves superior performance while broadening its applicability to both clinical settings and non-expert users. Code is available at https://github.com/HP-ML/AutoPromptMedSAM.git.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces a diffusion-based class prompt encoder that eliminates manual prompt generation for efficient, autonomous segmentation.
It employs an uncertainty-aware joint optimization with multiple loss functions to enhance generalization across various medical modalities.
Experimental results on datasets like AbdomenCT-1K, BraTS, and Kvasir-SEG reveal superior DSC and NSD scores compared to traditional SAM-based methods.

Diffusion-empowered AutoPrompt MedSAM

AutoMedSAM demonstrates a novel approach to overcoming the challenges in medical image segmentation posed by traditional models, particularly those based on the Segment Anything Model (SAM) framework. The proposed method effectively addresses the need for labor-intensive manual prompts and enhances usability by integrating semantic labels, thus widening its applicability to clinical environments.

Introduction and Challenges

Traditional SAM-based models rely heavily on manual prompt generation, demanding significant effort from medical experts and often limiting their deployment in real-world clinical settings. Furthermore, the lack of semantic labeling in output segmentation masks restricts their utility for non-experts, who may struggle to interpret raw binary masks without clear contextual labels (Figure 1).

Figure 1: Comparison with SAM-based models.(Left) The original SAM model relies on manual prompts from medical experts, restricting its usability and scenarios. (Middle) Current SAM-based methods employ specialist models for prompt generation, but these models are organ- or lesion-specific, limiting SAM's generalizability. (Right) Our method introduces an automatic diffusion-based class prompt encoder, removing the need for explicit prompts, adding semantic labels to masks, and enabling accurate, end-to-end segmentation for non-experts in diverse medical contexts.

Technical Approach

AutoMedSAM introduces a diffusion-based class prompt encoder, which replaces manual interactions with automated, end-to-end segmentation capabilities (Figure 2). The encoder generates prompt embeddings guided by class indices, integrating semantic structure and enhancing the prediction pipeline with meaningful knowledge. This approach eliminates the need for manual prompts and facilitates a streamlined, fully autonomous segmentation process.

Figure 2: An overview of the AutoMedSAM. AutoMedSAM generates dense and sparse prompt embeddings through a diffusion-based class prompt encoder, eliminating the need for explicit prompts. During training, we employ an uncertainty-aware joint optimization strategy with multiple loss functions for supervision, while transferring MedSAM's pre-trained knowledge to AutoMedSAM. This approach improves training efficiency and generalization. With end-to-end inference, AutoMedSAM overcomes SAM's limitations, enhancing usability and expanding its application scope and user base.

Diffusion-Based Class Prompt Encoder

The diffusion mechanism is central to AutoMedSAM's design. By utilizing a forward conditional generation process, class prompts are projected and interwoven with noise generation steps, enabling the encoder to focus more effectively on class-specific features (Figure 3). The encoder-decoder framework, with two independent branches, extracts both local and global features, contributing to precise prompt embedding generation necessary for accurate mask outputs.

Figure 3: Structure of the diffusion-based class prompt encoder. It is designed with an encoder and two independent decoder branches to extract local and global features, based on the practical significance of sparse and dense prompts.

Uncertainty-Aware Joint Optimization

Adding to its robust feature extraction capabilities, AutoMedSAM employs an uncertainty-aware joint optimization strategy. This approach integrates multiple loss functions, including pixel-based, region-based, and distribution-based losses, to maximize the model's generalization ability. By leveraging MedSAM's pre-trained knowledge alongside these losses, AutoMedSAM shows remarkable adaptability across various medical image modalities.

Experimental Evaluations

AutoMedSAM significantly outperforms existing models on datasets like AbdomenCT-1K, BraTS, and Kvasir-SEG, achieving superior Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD) scores. The model's efficacy is particularly evident in settings where traditional models struggle due to the need for explicit prompts or poor generalization across different medical contexts (Figure 4 and Figure 5).

Figure 4: The qualitative results of AutoMedSAM and other comparison models on AbdomenCT-1K. The bounding box represents the input prompt.

Figure 5: The qualitative analysis results of AutoMedSAM and other comparison models on BraTS, Kvasir-SEG, and Chest-XML.

Performance with Varying Prompt Box Sizes

The study reveals that segmentation accuracy correlates with prompt box precision, highlighting how explicit prompt errors can hinder model performance. AutoMedSAM's approach circumvents these limitations by embedding semantic class information directly, thus reducing dependency on exact prompt sizing (Figure 6).

Figure 6: The effect of different sized prompt boxes on segmentation masks. The marked numbers indicate the offset pixel size of the prompt box. IB represents the image boundary.

Conclusion

AutoMedSAM represents a significant advance in medical image segmentation technology, offering a more efficient and adaptable framework than traditional SAM-based models. By automating prompt generation and integrating semantic knowledge, this model extends its utility to non-expert users and enhances its applicability in diverse clinical settings. Future work could focus on scaling this approach across larger datasets to further improve its robustness and practical adoption in healthcare systems.

Markdown Report Issue