Adapting a Segmentation Foundation Model for Medical Image Classification (2505.06217v1)

Published 9 May 2025 in cs.CV

Abstract: Recent advancements in foundation models, such as the Segment Anything Model (SAM), have shown strong performance in various vision tasks, particularly image segmentation, due to their impressive zero-shot segmentation capabilities. However, effectively adapting such models for medical image classification is still a less explored topic. In this paper, we introduce a new framework to adapt SAM for medical image classification. First, we utilize the SAM image encoder as a feature extractor to capture segmentation-based features that convey important spatial and contextual details of the image, while freezing its weights to avoid unnecessary overhead during training. Next, we propose a novel Spatially Localized Channel Attention (SLCA) mechanism to compute spatially localized attention weights for the feature maps. The features extracted from SAM's image encoder are processed through SLCA to compute attention weights, which are then integrated into deep learning classification models to enhance their focus on spatially relevant or meaningful regions of the image, thus improving classification performance. Experimental results on three public medical image classification datasets demonstrate the effectiveness and data-efficiency of our approach.

Summary

Adapting a Segmentation Foundation Model for Medical Image Classification

The paper, "Adapting a Segmentation Foundation Model for Medical Image Classification," introduces an innovative framework to adapt the Segment Anything Model (SAM) for use in medical image classification, addressing an area that has been less explored compared to applications in segmentation tasks. The framework is developed to leverage SAM's capabilities, originally designed for image segmentation, to enhance the accuracy and efficiency of classification tasks in the medical domain.

The researchers build on the proven ability of SAM's image encoder to capture rich, segmentation-based features, which convey important spatial and contextual details of images. The proposed methodology freezes the weights of SAM's image encoder during training to serve as a feature extractor, minimizing retraining overhead while preserving knowledge obtained during pre-training. This helps in capturing segmentation-based features without additional computational burden.

To further enhance classification capabilities, the authors introduce a novel Spatially Localized Channel Attention (SLCA) mechanism. SLCA computes spatially localized attention weights for SAM’s extracted features, which are then integrated into deep learning classification models. This integration facilitates a focus on spatially meaningful regions of the image, thereby improving the classification performance of the models. These attention mechanisms are computationally efficient, offering minimal overhead relative to the performance gains achieved.

The framework's effectiveness is validated through experiments conducted on three public datasets for medical image classification: RetinaMNIST, BreastMNIST, and ISIC 2017. Across various deep learning models including CNN-based framework (ResNet152, SENet154) and transformer-based architecture (Swin Transformer v2), the approach consistent improvements in accuracy and data efficiency, especially when utilizing smaller fractions of training data, was demonstrated. The improvements are particularly significant with limited training data, showing the framework's potential for enhancing model performance in situations with scarce annotations – a common scenario in medical imaging tasks.

Key empirical results showed that integrating SAM-derived features resulted in accuracy improvements up to 5.75% on RetinaMNIST and 5.0% on ISIC 2017 datasets. Moreover, in contrast to prior work such as SAMAug-C, this method efficiently exploits SAM's capabilities for extracting meaningful spatial information, with measurable gains in classification tasks.

A series of ablation studies further elucidate the contributions of specific framework components such as SLCA and feature extractor choices, confirming their roles in performance enhancement. Direct addition of SAM features into classification models without SLCA led to performance drop-offs, underscoring the necessity of careful integration.

The theoretical and practical implications of this paper are significant because they bridge foundation models from segmentation to classification tasks, potentially meeting the needs of medical imaging scenarios where high accuracy and the ability to discern subtle anatomical variations are critical. Future research could explore extending this SAM adaptation for other domains within medical imaging, alongside investigating its application within multi-modal medical data integration contexts, thereby further enhancing model interpretability and prediction accuracy.

In conclusion, the paper provides evidence that segmentation models like SAM can be robustly adapted for medical classification tasks, ensuring that spatially rich features can augment downstream model performance with minimal annotation.

Adapting a Segmentation Foundation Model for Medical Image Classification (2505.06217v1)

Summary

Adapting a Segmentation Foundation Model for Medical Image Classification

Related Papers