Enhancing Representation in Medical Vision-Language Foundation Models via Multi-Scale Information Extraction Techniques (2401.01583v2)
Abstract: The development of medical vision-language foundation models has attracted significant attention in the field of medicine and healthcare due to their promising prospect in various clinical applications. While previous studies have commonly focused on feature learning at a single learning scale, investigation on integrating multi-scale information is lacking, which may hinder the potential for mutual reinforcement among these features. This paper aims to bridge this gap by proposing a method that effectively exploits multi-scale information to enhance the performance of medical foundation models. The proposed method simultaneously exploits features at the local, instance, modality and global aspects, facilitating comprehensive representation learning within the models. We evaluate the effectiveness of the proposed method on six open-source datasets across different clinical tasks, demonstrating its ability to enhance the performance of medical foundation models.
- Qi Chang et al., “Mining multi-center heterogeneous medical data with distributed synthetic learning,” Nature Communications, vol. 14, no. 1, pp. 5510, 2023.
- “Self-supervised learning in medicine and healthcare,” Nature Biomedical Engineering, vol. 6, no. 12, pp. 1346–1352, 2022.
- Yucheng Tang et al., “Self-supervised pre-training of swin transformers for 3d medical image analysis,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 20698–20708.
- Hong-Yu Zhou et al., “Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports,” Nature Machine Intelligence, vol. 4, no. 1, pp. 32–40, 2022.
- Hong-Yu Zhou et al., “Advancing radiograph representation learning with masked record modeling,” The Eleventh International Conference on Learning Representations., 2022.
- Benedikt Boecking et al., “Making the most of text semantics to improve biomedical vision–language processing,” in European conference on computer vision. Springer, 2022, pp. 1–21.
- Shih-Cheng Huang et al., “Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition,” in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 3922–3931.
- Che Liu et al., “M-flag: Medical vision-language pre-training with frozen language models and latent space geometry optimization,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2023, pp. 637–647.
- Chaoyi Wu et al., “Medklip: Medical knowledge enhanced language-image pre-training,” Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
- Yuhao Zhang et al., “Contrastive learning of medical visual representations from paired images and text,” in Machine Learning for Healthcare Conference. PMLR, 2022, pp. 2–25.
- Zhongwei Wan et al., “Med-unic: Unifying cross-lingual medical vision-language pre-training by diminishing bias,” arXiv preprint arXiv:2305.19894, 2023.
- Alec Radford et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- Junnan Li et al., “Align before fuse: Vision and language representation learning with momentum distillation,” Advances in neural information processing systems, vol. 34, pp. 9694–9705, 2021.
- Fuying Wang et al., “Multi-granularity cross-modal alignment for generalized medical visual representation learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 33536–33549, 2022.
- Kaiming He et al., “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000–16009.
- Jacob Devlin et al., “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), June 2019, pp. 4171–4186.