Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ladder Fine-tuning approach for SAM integrating complementary network (2306.12737v1)

Published 22 Jun 2023 in cs.CV

Abstract: Recently, foundation models have been introduced demonstrating various tasks in the field of computer vision. These models such as Segment Anything Model (SAM) are generalized models trained using huge datasets. Currently, ongoing research focuses on exploring the effective utilization of these generalized models for specific domains, such as medical imaging. However, in medical imaging, the lack of training samples due to privacy concerns and other factors presents a major challenge for applying these generalized models to medical image segmentation task. To address this issue, the effective fine tuning of these models is crucial to ensure their optimal utilization. In this study, we propose to combine a complementary Convolutional Neural Network (CNN) along with the standard SAM network for medical image segmentation. To reduce the burden of fine tuning large foundation model and implement cost-efficient trainnig scheme, we focus only on fine-tuning the additional CNN network and SAM decoder part. This strategy significantly reduces trainnig time and achieves competitive results on publicly available dataset. The code is available at https://github.com/11yxk/SAM-LST.

Analysis of the Ladder Fine-tuning Approach for the Segment Anything Model in Medical Image Segmentation

The paper under scrutiny introduces a novel approach to enhance the applicability of the Segment Anything Model (SAM) for domain-specific tasks, particularly in medical image segmentation. The proposed methodology employs a complementary Convolutional Neural Network (CNN) alongside the existing architecture of SAM to address challenges arising from limited training datasets inherent in the medical imaging domain.

Overview and Motivation

The paper identifies a significant challenge in applying generalized foundation models such as SAM to specific domains like medical imaging, where data privacy issues and limited annotated datasets often impede effective training. SAM, although powerful and versatile in computer vision tasks, does not inherently adapt well to the nuanced and varied characteristics of medical images. Consequently, fine-tuning these generalized models is essential to improve their performance for medical image segmentation tasks.

Methodology

The proposed Ladder Fine-Tuning approach is designed to mitigate the extensive computational demands and resource constraints typically associated with fine-tuning large foundation models. Instead of adapting the entire SAM architecture, the authors propose a hybrid approach:

  1. Integration of CNN: A pre-trained ResNet18, modified to match the feature map dimensions of SAM’s image encoder, is integrated. This additional CNN serves as a complementary encoder to capture domain-specific features in medical images.
  2. Selective Fine-Tuning: The fine-tuning process is restricted to the SAM's decoder and the parameters of the added CNN component. This selective parameter update strategy significantly reduces computational costs and training duration.
  3. Learnable Gating Mechanism: A learnable parameter is employed to dynamically weight the contribution of SAM's and the CNN’s features, allowing for adaptive integration of insights from both networks.
  4. Loss Functions: The combination of Cross Entropy and Dice loss functions ensures the segmentation network's robustness and accuracy.

Experimental Evaluation

The method was evaluated on the multi-organ Synapse dataset, achieving 79.45% Dice Score and 35.35mm HD95, demonstrating competitive performance relative to state-of-the-art methods. Notably, the model's training time was reduced by 30-40% compared to traditional SAM fine-tuning strategies, emphasizing its efficiency in resource utilization.

Implications and Future Directions

The findings signify a pivotal advancement in the application of foundation models to specific domains. Practically, this approach reduces the computational burden and resources needed for training, making it a viable option for medical imaging applications where data availability and computational power may be limited.

Theoretically, integrating lightweight networks like CNNs with large models such as SAM could catalyze further research into modular training techniques, not only for medical imaging but also for other data-restricted domains.

Future developments could involve exploring alternative architectural designs other than ResNet18 or leveraging transformers as the complementary network. Such explorations may yield even higher performance, potentially enhancing the adaptability and utility of foundation models in specialized tasks.

Conclusion

The paper offers a well-founded contribution to the field of medical image segmentation through its Ladder Fine-Tuning approach, significantly advancing the operational effectiveness of SAM in domain-specific contexts. This work underscores the importance of tailored model adaptation strategies in maximizing the utility of foundation models across diverse applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
  2. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18.   Springer, 2015, pp. 234–241.
  3. Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” in Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4.   Springer, 2018, pp. 3–11.
  4. L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
  5. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  6. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  7. J. Chen, Y. Lu, Q. Yu, X. Luo, E. Adeli, Y. Wang, L. Lu, A. L. Yuille, and Y. Zhou, “Transunet: Transformers make strong encoders for medical image segmentation,” arXiv preprint arXiv:2102.04306, 2021.
  8. Y. Tang, D. Yang, W. Li, H. R. Roth, B. Landman, D. Xu, V. Nath, and A. Hatamizadeh, “Self-supervised pre-training of swin transformers for 3d medical image analysis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 730–20 740.
  9. H.-Y. Zhou, J. Guo, Y. Zhang, L. Yu, L. Wang, and Y. Yu, “nnformer: Interleaved transformer for volumetric segmentation,” arXiv preprint arXiv:2109.03201, 2021.
  10. H. Cao, Y. Wang, J. Chen, D. Jiang, X. Zhang, Q. Tian, and M. Wang, “Swin-unet: Unet-like pure transformer for medical image segmentation,” arXiv preprint arXiv:2105.05537, 2021.
  11. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
  12. R. Bommasani, D. A. Hudson, E. Adeli, R. Altman, S. Arora, S. von Arx, M. S. Bernstein, J. Bohg, A. Bosselut, E. Brunskill et al., “On the opportunities and risks of foundation models,” arXiv preprint arXiv:2108.07258, 2021.
  13. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
  14. J. Wu, R. Fu, H. Fang, Y. Liu, Z. Wang, Y. Xu, Y. Jin, and T. Arbel, “Medical sam adapter: Adapting segment anything model for medical image segmentation,” arXiv preprint arXiv:2304.12620, 2023.
  15. K. Zhang and D. Liu, “Customized segment anything model for medical image segmentation,” arXiv preprint arXiv:2304.13785, 2023.
  16. T. Chen, L. Zhu, C. Ding, R. Cao, S. Zhang, Y. Wang, Z. Li, L. Sun, P. Mao, and Y. Zang, “Sam fails to segment anything?–sam-adapter: Adapting sam in underperformed scenes: Camouflage, shadow, and more,” arXiv preprint arXiv:2304.09148, 2023.
  17. Y.-L. Sung, J. Cho, and M. Bansal, “Lst: Ladder side-tuning for parameter and memory efficient transfer learning,” arXiv preprint arXiv:2206.06522, 2022.
  18. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  19. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  20. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  21. V. Lialin, V. Deshpande, and A. Rumshisky, “Scaling down to scale up: A guide to parameter-efficient fine-tuning,” arXiv preprint arXiv:2303.15647, 2023.
  22. N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in International Conference on Machine Learning.   PMLR, 2019, pp. 2790–2799.
  23. B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” arXiv preprint arXiv:2104.08691, 2021.
  24. E. B. Zaken, S. Ravfogel, and Y. Goldberg, “Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models,” arXiv preprint arXiv:2106.10199, 2021.
  25. K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 000–16 009.
  26. M. Tancik, P. Srinivasan, B. Mildenhall, S. Fridovich-Keil, N. Raghavan, U. Singhal, R. Ramamoorthi, J. Barron, and R. Ng, “Fourier features let networks learn high frequency functions in low dimensional domains,” Advances in Neural Information Processing Systems, vol. 33, pp. 7537–7547, 2020.
  27. F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 2016 fourth international conference on 3D vision (3DV).   Ieee, 2016, pp. 565–571.
  28. S. Fu, Y. Lu, Y. Wang, Y. Zhou, W. Shen, E. Fishman, and A. Yuille, “Domain adaptive relational reasoning for 3d multi-organ segmentation,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2020: 23rd International Conference, Lima, Peru, October 4–8, 2020, Proceedings, Part I 23.   Springer, 2020, pp. 656–666.
  29. H. Wang, S. Xie, L. Lin, Y. Iwamoto, X.-H. Han, Y.-W. Chen, and R. Tong, “Mixed transformer u-net for medical image segmentation,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 2390–2394.
  30. X. Huang, Z. Deng, D. Li, and X. Yuan, “Missformer: An effective medical image segmentation transformer,” arXiv preprint arXiv:2109.07162, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Shurong Chai (3 papers)
  2. Rahul Kumar Jain (3 papers)
  3. Shiyu Teng (2 papers)
  4. Jiaqing Liu (20 papers)
  5. Yinhao Li (19 papers)
  6. Tomoko Tateyama (2 papers)
  7. Yen-Wei Chen (36 papers)
Citations (26)