Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Medical Vision Language Pretraining: A survey (2312.06224v1)

Published 11 Dec 2023 in cs.CV and cs.CL

Abstract: Medical Vision Language Pretraining (VLP) has recently emerged as a promising solution to the scarcity of labeled data in the medical domain. By leveraging paired/unpaired vision and text datasets through self-supervised learning, models can be trained to acquire vast knowledge and learn robust feature representations. Such pretrained models have the potential to enhance multiple downstream medical tasks simultaneously, reducing the dependency on labeled data. However, despite recent progress and its potential, there is no such comprehensive survey paper that has explored the various aspects and advancements in medical VLP. In this paper, we specifically review existing works through the lens of different pretraining objectives, architectures, downstream evaluation tasks, and datasets utilized for pretraining and downstream tasks. Subsequently, we delve into current challenges in medical VLP, discussing existing and potential solutions, and conclude by highlighting future directions. To the best of our knowledge, this is the first survey focused on medical VLP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (158)
  1. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, 2012.
  2. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  3. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in International conference on machine learning.   PMLR, 2020, pp. 1597–1607.
  4. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  5. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  6. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning.   PMLR, 2015, pp. 2048–2057.
  7. L. H. Li, M. Yatskar, D. Yin, C.-J. Hsieh, and K.-W. Chang, “Visualbert: A simple and performant baseline for vision and language,” arXiv preprint arXiv:1908.03557, 2019.
  8. J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019.
  9. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  10. W. Wang, C. Chen, M. Ding, H. Yu, S. Zha, and J. Li, “Transbts: Multimodal brain tumor segmentation using transformer,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24.   Springer, 2021, pp. 109–119.
  11. O. Dalmaz, M. Yurt, and T. Çukur, “Resvit: Residual vision transformers for multimodal medical image synthesis,” IEEE Transactions on Medical Imaging, vol. 41, no. 10, pp. 2598–2614, 2022.
  12. N. Braman, J. W. Gordon, E. T. Goossens, C. Willis, M. C. Stumpe, and J. Venkataraman, “Deep orthogonal fusion: multimodal prognostic biomarker discovery integrating radiology, pathology, genomic, and clinical data,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24.   Springer, 2021, pp. 667–677.
  13. Y. Zhang, N. He, J. Yang, Y. Li, D. Wei, Y. Huang, Y. Zhang, Z. He, and Y. Zheng, “mmformer: Multimodal medical transformer for incomplete multimodal learning of brain tumor segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2022, pp. 107–117.
  14. C. Sudlow, J. Gallacher, N. Allen, V. Beral, P. Burton, J. Danesh, P. Downey, P. Elliott, J. Green, M. Landray et al., “Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age,” PLoS medicine, vol. 12, no. 3, p. e1001779, 2015.
  15. A. E. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, R. G. Mark, and S. Horng, “Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports,” Scientific data, vol. 6, no. 1, p. 317, 2019.
  16. W. O. Ikezogwo, M. S. Seyfioglu, F. Ghezloo, D. S. C. Geva, F. S. Mohammed, P. K. Anand, R. Krishna, and L. Shapiro, “Quilt-1m: One million image-text pairs for histopathology,” arXiv preprint arXiv:2306.11207, 2023.
  17. A. Taleb, M. Kirchler, R. Monti, and C. Lippert, “Contig: Self-supervised multimodal contrastive learning for medical imaging with genetics,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 908–20 921.
  18. A. Taleb, C. Lippert, T. Klein, and M. Nabi, “Multimodal self-supervised learning for medical image analysis,” in International conference on information processing in medical imaging.   Springer, 2021, pp. 661–673.
  19. Á. S. Hervella, J. Rouco, J. Novo, and M. Ortega, “Retinal image understanding emerges from self-supervised multimodal reconstruction,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part I.   Springer, 2018, pp. 321–328.
  20. C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International conference on machine learning.   PMLR, 2021, pp. 4904–4916.
  21. X. Geng, H. Liu, L. Lee, D. Schuurmans, S. Levine, and P. Abbeel, “Multimodal masked autoencoders learn transferable representations,” arXiv preprint arXiv:2205.14204, 2022.
  22. S.-C. Huang, L. Shen, M. P. Lungren, and S. Yeung, “Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3942–3951.
  23. Z. Wang, Z. Wu, D. Agarwal, and J. Sun, “Medclip: Contrastive learning from unpaired medical images and text,” arXiv preprint arXiv:2210.10163, 2022.
  24. Y. Khare, V. Bagal, M. Mathew, A. Devi, U. D. Priyakumar, and C. Jawahar, “Mmbert: Multimodal bert pretraining for improved medical vqa,” in 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI).   IEEE, 2021, pp. 1033–1036.
  25. Y. Zhang, H. Jiang, Y. Miura, C. D. Manning, and C. P. Langlotz, “Contrastive learning of medical visual representations from paired images and text,” in Machine Learning for Healthcare Conference.   PMLR, 2022, pp. 2–25.
  26. G. Dawidowicz, E. Hirsch, and A. Tal, “Limitr: Leveraging local information for medical image-text representation,” arXiv preprint arXiv:2303.11755, 2023.
  27. F. Wang, Y. Zhou, S. Wang, V. Vardhanabhuti, and L. Yu, “Multi-granularity cross-modal alignment for generalized medical visual representation learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 33 536–33 549, 2022.
  28. Z. Sun, M. Lin, Q. Zhu, Q. Xie, F. Wang, Z. Lu, and Y. Peng, “A scoping review on multimodal deep learning in biomedical images and texts,” Journal of Biomedical Informatics, p. 104482, 2023.
  29. G. Chételat, “Multimodal neuroimaging in alzheimer’s disease: early diagnosis, physiopathological mechanisms, and impact of lifestyle,” Journal of Alzheimer’s disease, vol. 64, no. s1, pp. S199–S211, 2018.
  30. Y. Zong, O. Mac Aodha, and T. Hospedales, “Self-supervised multimodal learning: A survey,” arXiv preprint arXiv:2304.01008, 2023.
  31. K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009.
  32. L. Xu, B. Liu, A. H. Khan, L. Fan, and X.-M. Wu, “Multi-modal pre-training for medical vision-language understanding and generation: An empirical study with a new benchmark,” arXiv preprint arXiv:2306.06494, 2023.
  33. L. Dong, N. Yang, W. Wang, F. Wei, X. Liu, Y. Wang, J. Gao, M. Zhou, and H.-W. Hon, “Unified language model pre-training for natural language understanding and generation,” Advances in neural information processing systems, vol. 32, 2019.
  34. H.-Y. Zhou, C. Lian, L. Wang, and Y. Yu, “Advancing radiograph representation learning with masked record modeling,” arXiv preprint arXiv:2301.13155, 2023.
  35. R. Ding, J. Hall, N. Tenenholtz, and K. Severson, “Improving mitosis detection on histopathology images using large vision-language models,” arXiv preprint arXiv:2310.07176, 2023.
  36. S. Eslami, C. Meinel, and G. De Melo, “Pubmedclip: How much does clip benefit visual question answering in the medical domain?” in Findings of the Association for Computational Linguistics: EACL 2023, 2023, pp. 1151–1163.
  37. Z. Huang, F. Bianchi, M. Yuksekgonul, T. J. Montine, and J. Zou, “A visual–language foundation model for pathology image analysis using medical twitter,” Nature medicine, vol. 29, no. 9, pp. 2307–2316, 2023.
  38. H. Dadoun, H. Delingette, A.-L. Rousseau, E. de Kerviler, and N. Ayache, “Joint representation learning from french radiological reports and ultrasound images,” in IEEE ISBI 2023-International Symposium on Biomedical Imaging, 2023.
  39. Y. Pan, A. D. Gernand, J. A. Goldstein, L. Mithal, D. Mwinyelle, and J. Z. Wang, “Vision-language contrastive learning approach to robust automatic placenta analysis using photographic images,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2022, pp. 707–716.
  40. G. Sérieys, C. Kurtz, L. Fournier, and F. Cloppet, “Text-guided visual representation learning for medical image retrieval systems,” in 2022 26th International Conference on Pattern Recognition (ICPR).   IEEE, 2022, pp. 593–598.
  41. S. Zhang, Y. Xu, N. Usuyama, J. Bagga, R. Tinn, S. Preston, R. Rao, M. Wei, N. Valluri, C. Wong et al., “Large-scale domain-specific pretraining for biomedical vision-language processing,” arXiv preprint arXiv:2303.00915, 2023.
  42. T. Chen, C. Luo, and L. Li, “Intriguing properties of contrastive losses,” Advances in Neural Information Processing Systems, vol. 34, pp. 11 834–11 845, 2021.
  43. Y. Pan, T. Cai, M. Mehta, A. D. Gernand, J. A. Goldstein, L. Mithal, D. Mwinyelle, K. Gallagher, and J. Z. Wang, “Enhancing automatic placenta analysis through distributional feature recomposition in vision-language contrastive learning,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2023, pp. 116–126.
  44. C. Liu, S. Cheng, M. Shi, A. Shah, W. Bai, and R. Arcucci, “Imitate: Clinical prior guided hierarchical vision-language pre-training,” arXiv preprint arXiv:2310.07355, 2023.
  45. P. Müller, G. Kaissis, C. Zou, and D. Rueckert, “Joint learning of localized representations from medical images and reports,” in European Conference on Computer Vision.   Springer, 2022, pp. 685–701.
  46. S. Rizvi, R. Tang, X. Jiang, X. Ma, and X. Hu, “Local contrastive learning for medical image recognition,” arXiv preprint arXiv:2303.14153, 2023.
  47. R. Liao, D. Moyer, M. Cha, K. Quigley, S. Berkowitz, S. Horng, P. Golland, and W. M. Wells, “Multimodal representation learning via maximization of local mutual information,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24.   Springer, 2021, pp. 273–283.
  48. P. Müller, G. Kaissis, and D. Rueckert, “The role of local alignment and uniformity in image-text contrastive learning on medical images,” arXiv preprint arXiv:2211.07254, 2022.
  49. J. Jang, D. Kyung, S. H. Kim, H. Lee, K. Bae, and E. Choi, “Significantly improving zero-shot x-ray pathology classification via fine-tuning pre-trained image-text encoders,” arXiv preprint arXiv:2212.07050, 2022.
  50. Z. Lin, E. Bas, K. Y. Singh, G. Swaminathan, and R. Bhotika, “Relaxing contrastiveness in multimodal representation learning,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2227–2236.
  51. O. Bodenreider, “The unified medical language system (umls): integrating biomedical terminology,” Nucleic acids research, vol. 32, no. suppl_1, pp. D267–D270, 2004.
  52. X. Chen, Y. He, C. Xue, R. Ge, S. Li, and G. Yang, “Knowledge boosting: Rethinking medical contrastive vision-language pre-training,” arXiv preprint arXiv:2307.07246, 2023.
  53. J. H. Moon, H. Lee, W. Shin, Y.-H. Kim, and E. Choi, “Multi-modal understanding and generation for medical images and text via vision-language pre-training,” IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 12, pp. 6070–6080, 2022.
  54. Z. Chen, Y. Du, J. Hu, Y. Liu, G. Li, X. Wan, and T.-H. Chang, “Multi-modal masked autoencoders for medical vision-and-language pre-training,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2022, pp. 679–689.
  55. Z. Chen, G. Li, and X. Wan, “Align, reason and learn: Enhancing medical vision-and-language pre-training with knowledge,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 5152–5161.
  56. C. Chen, A. Zhong, D. Wu, J. Luo, and Q. Li, “Contrastive masked image-text modeling for medical visual representation learning,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2023, pp. 493–503.
  57. P. Cheng, L. Lin, J. Lyu, Y. Huang, W. Luo, and X. Tang, “Prior: Prototype representation joint learning from medical images and reports,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21 361–21 371.
  58. W. Lin, Z. Zhao, X. Zhang, C. Wu, Y. Zhang, Y. Wang, and W. Xie, “Pmc-clip: Contrastive language-image pre-training using biomedical documents,” arXiv preprint arXiv:2303.07240, 2023.
  59. K. Zhang, H. Jiang, J. Zhang, Q. Huang, J. Fan, J. Yu, and W. Han, “Multi-task paired masking with alignment modeling for medical vision-language pre-training,” arXiv preprint arXiv:2305.07920, 2023.
  60. J. D. Silva, B. Martins, and J. Magalhães, “Contrastive training of a multimodal encoder for medical visual question answering,” Intelligent Systems with Applications, vol. 18, p. 200221, 2023.
  61. F. Liu, T. Zhu, X. Wu, B. Yang, C. You, C. Wang, L. Lu, Z. Liu, Y. Zheng, X. Sun et al., “A medical multimodal large language model for future pandemics,” npj Digital Medicine, vol. 6, no. 1, p. 226, 2023.
  62. B. Boecking, N. Usuyama, S. Bannur, D. C. Castro, A. Schwaighofer, S. Hyland, M. Wetscherek, T. Naumann, A. Nori, J. Alvarez-Valle et al., “Making the most of text semantics to improve biomedical vision–language processing,” in European conference on computer vision.   Springer, 2022, pp. 1–21.
  63. W. Huang, H. Zhou, C. Li, H. Yang, J. Liu, and S. Wang, “Enhancing representation in radiography-reports foundation model: A granular alignment algorithm using masked contrastive learning,” arXiv preprint arXiv:2309.05904, 2023.
  64. X. Wang, Z. Xu, L. Tam, D. Yang, and D. Xu, “Self-supervised image-text pre-training with mixed data in chest x-rays,” arXiv preprint arXiv:2103.16022, 2021.
  65. Z. Chen, S. Diao, B. Wang, G. Li, and X. Wan, “Towards unifying medical vision-and-language pre-training via soft prompts,” arXiv preprint arXiv:2302.08958, 2023.
  66. P. Li, G. Liu, L. Tan, J. Liao, and S. Zhong, “Self-supervised vision-language pretraining for medial visual question answering,” in 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI).   IEEE, 2023, pp. 1–5.
  67. P. Li, G. Liu, J. He, Z. Zhao, and S. Zhong, “Masked vision and language pre-training with unimodal and multimodal contrastive losses for medical visual question answering,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2023, pp. 374–383.
  68. C. Seibold, S. Reiß, M. S. Sarfraz, R. Stiefelhagen, and J. Kleesiek, “Breaking with fixed set pathology recognition through report-guided contrastive training,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2022, pp. 690–700.
  69. A. Miech, J.-B. Alayrac, L. Smaira, I. Laptev, J. Sivic, and A. Zisserman, “End-to-end learning of visual representations from uncurated instructional videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9879–9889.
  70. H.-Y. Zhou, X. Chen, Y. Zhang, R. Luo, L. Wang, and Y. Yu, “Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports,” Nature Machine Intelligence, vol. 4, no. 1, pp. 32–40, 2022.
  71. A. Palepu and A. Beam, “Tier: Text-image entropy regularization for medical clip-style models,” Proceedings of Machine Learning Research LEAVE UNSET, vol. 1, p. 21, 2023.
  72. C. Zhan, P. Peng, H. Wang, T. Chen, and H. Wang, “Uniclam: Contrastive representation learning with adversarial masking for unified and interpretable medical vision question answering,” arXiv preprint arXiv:2212.10729, 2022.
  73. B. Yan and M. Pei, “Clinical-bert: Vision-language pre-training for radiograph diagnosis and reports generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 2982–2990.
  74. J. Lei, L. Dai, H. Jiang, C. Wu, X. Zhang, Y. Zhang, J. Yao, W. Xie, Y. Zhang, Y. Li et al., “Unibrain: Universal brain mri diagnosis with hierarchical knowledge-enhanced pre-training,” arXiv preprint arXiv:2309.06828, 2023.
  75. C. Wu, X. Zhang, Y. Zhang, Y. Wang, and W. Xie, “Medklip: Medical knowledge enhanced language-image pre-training,” medRxiv, pp. 2023–01, 2023.
  76. X. Zhang, C. Wu, Y. Zhang, W. Xie, and Y. Wang, “Knowledge-enhanced visual-language pre-training on chest radiology images,” Nature Communications, vol. 14, no. 1, p. 4542, 2023.
  77. J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya et al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 590–597.
  78. S. Jain, A. Agrawal, A. Saporta, S. Q. Truong, D. N. Duong, T. Bui, P. Chambon, Y. Zhang, M. P. Lungren, A. Y. Ng et al., “Radgraph: Extracting clinical entities and relations from radiology reports,” arXiv preprint arXiv:2106.14463, 2021.
  79. K. You, J. Gu, J. Ham, B. Park, J. Kim, E. K. Hong, W. Baek, and B. Roh, “Cxr-clip: Toward large scale chest x-ray language-image pre-training,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2023, pp. 101–111.
  80. Y. Wang, “Unified medical image-text-label contrastive learning with continuous prompt,” arXiv preprint arXiv:2307.05920, 2023.
  81. J. Silva-Rodriguez, H. Chakor, R. Kobbi, J. Dolz, and I. B. Ayed, “A foundation language-image model of the retina (flair): Encoding expert knowledge in text supervision,” arXiv preprint arXiv:2308.07898, 2023.
  82. Y. Chen, C. Liu, W. Huang, S. Cheng, R. Arcucci, and Z. Xiong, “Generative text-guided 3d vision-language pretraining for unified medical image segmentation,” arXiv preprint arXiv:2306.04811, 2023.
  83. C. Liu, A. Shah, W. Bai, and R. Arcucci, “Utilizing synthetic data for medical vision-language pre-training: Bypassing the need for real images,” arXiv preprint arXiv:2310.07027, 2023.
  84. P. Chambon, C. Bluethgen, J.-B. Delbrouck, R. Van der Sluijs, M. Połacin, J. M. Z. Chaves, T. M. Abraham, S. Purohit, C. P. Langlotz, and A. Chaudhari, “Roentgen: vision-language foundation model for chest x-ray generation,” arXiv preprint arXiv:2211.12737, 2022.
  85. L. Milecki, V. Kalogeiton, S. Bodard, D. Anglicheau, J.-M. Correas, M.-O. Timsit, and M. Vakalopoulou, “Medimp: 3d medical images with clinical prompts from limited tabular data for renal transplantation,” in Medical Imaging with Deep Learning, 2023.
  86. S. Bannur, S. Hyland, Q. Liu, F. Perez-Garcia, M. Ilse, D. C. Castro, B. Boecking, H. Sharma, K. Bouzid, A. Thieme et al., “Learning to exploit temporal structure for biomedical vision-language processing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 016–15 027.
  87. Q. Chen, X. Hu, Z. Wang, and Y. Hong, “Medblip: Bootstrapping language-image pre-training from 3d medical images and texts,” arXiv preprint arXiv:2305.10799, 2023.
  88. C. Niu and G. Wang, “Ct multi-task learning with a large image-text (lit) model,” bioRxiv, pp. 2023–04, 2023.
  89. C. Van Uden, C. Bluethgen, M. Attias, M. Polacin, H. H. Guo, N. Simha, R. Raj, and C. Langlotz, “Exploring the versatility of zero-shot clip for interstitial lung disease classification,” arXiv preprint arXiv:2306.01111, 2023.
  90. K. Yuan, V. Srivastav, T. Yu, J. Lavanchy, P. Mascagni, N. Navab, and N. Padoy, “Learning multi-modal representations by watching hundreds of surgical video lectures,” 2023.
  91. L. Fan, D. Krishnan, P. Isola, D. Katabi, and Y. Tian, “Improving clip training with language rewrites,” arXiv preprint arXiv:2305.20088, 2023.
  92. R. Windsor, A. Jamaludin, T. Kadir, and A. Zisserman, “Vision-language modelling for radiological imaging and reports in the low data regime,” Proceedings of Machine Learning Research, 2023.
  93. Y. Li, F. Liang, L. Zhao, Y. Cui, W. Ouyang, J. Shao, F. Yu, and J. Yan, “Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm,” arXiv preprint arXiv:2110.05208, 2021.
  94. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
  95. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  96. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 012–10 022.
  97. C. Shu, Y. Zhu, X. Tang, J. Xiao, Y. Chen, X. Li, Q. Zhang, and Z. Lu, “Miter: Medical image-text joint adaptive pretraining with multi-level contrastive learning,” Expert Systems with Applications, p. 121526, 2023.
  98. Y. Gu, R. Tinn, H. Cheng, M. Lucas, N. Usuyama, X. Liu, T. Naumann, J. Gao, and H. Poon, “Domain-specific language model pretraining for biomedical natural language processing,” ACM Transactions on Computing for Healthcare (HEALTH), vol. 3, no. 1, pp. 1–23, 2021.
  99. E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, and M. McDermott, “Publicly available clinical bert embeddings,” arXiv preprint arXiv:1904.03323, 2019.
  100. E. Tiu, E. Talius, P. Patel, C. P. Langlotz, A. Y. Ng, and P. Rajpurkar, “Expert-level detection of pathologies from unannotated chest x-ray images via self-supervised learning,” Nature Biomedical Engineering, vol. 6, no. 12, pp. 1399–1406, 2022.
  101. E. Dack, L. Brigato, M. McMurray, M. Fontanellaz, T. Frauenfelder, H. Hoppe, A. Exadaktylos, T. Geiser, M. Funke-Chambour, A. Christe et al., “An empirical analysis for zero-shot multi-label classification on covid-19 ct scans and uncurated reports,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2634–2643.
  102. Z. Qin, H. Yi, Q. Lao, and K. Li, “Medical image understanding with pretrained vision language models: A comprehensive study,” arXiv preprint arXiv:2209.15517, 2022.
  103. Z. Lai, Z. Li, L. C. Oliveira, J. Chauhan, B. N. Dugger, and C.-N. Chuah, “Clipath: Fine-tune clip with visual feature fusion for pathology image analysis towards minimizing data collection efforts,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2374–2380.
  104. T.-L. Wu, S. Singh, S. Paul, G. Burns, and N. Peng, “Melinda: A multimodal dataset for biomedical experiment method classification,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 16, 2021, pp. 14 076–14 084.
  105. M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” in European Conference on Computer Vision.   Springer, 2022, pp. 709–727.
  106. Y. Zhang, J. Gao, M. Zhou, X. Wang, Y. Qiao, S. Zhang, and D. Wang, “Text-guided foundation model adaptation for pathological image classification,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2023, pp. 272–282.
  107. K. Poudel, M. Dhakal, P. Bhandari, R. Adhikari, S. Thapaliya, and B. Khanal, “Exploring transfer learning in medical image segmentation using vision-language models,” arXiv preprint arXiv:2308.07706, 2023.
  108. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18.   Springer, 2015, pp. 234–241.
  109. C. Liu, S. Cheng, C. Chen, M. Qiao, W. Zhang, A. Shah, W. Bai, and R. Arcucci, “M-flag: medical vision-language pre-training with frozen language models and latent space geometry optimization,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2023, pp. 637–647.
  110. P. Müller, G. Kaissis, C. Zou, and D. Rueckert, “Radiological reports improve pre-training for localized imaging tasks on chest x-rays,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2022, pp. 647–657.
  111. M. Guo, H. Yi, Z. Qin, H. Wang, A. Men, and Q. Lao, “Multiple prompt fusion for zero-shot lesion detection using vision-language models,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2023, pp. 283–292.
  112. L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965–10 975.
  113. P. Wang, W. M. Wells, S. Berkowitz, S. Horng, and P. Golland, “Using multiple instance learning to build multimodal representations,” in International Conference on Information Processing in Medical Imaging.   Springer, 2023, pp. 457–470.
  114. M. Endo, R. Krishnan, V. Krishna, A. Y. Ng, and P. Rajpurkar, “Retrieval-based chest x-ray report generation using a pre-trained contrastive language-image model,” in Machine Learning for Health.   PMLR, 2021, pp. 209–219.
  115. A. Smit, S. Jain, P. Rajpurkar, A. Pareek, A. Y. Ng, and M. P. Lungren, “Chexbert: Combining automatic labelers and expert annotations for accurate radiology report labeling using bert. arxiv 2020,” arXiv preprint arXiv:2004.09167, 2004.
  116. D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald, “Preparing a collection of radiology examinations for distribution and retrieval,” Journal of the American Medical Informatics Association, vol. 23, no. 2, pp. 304–310, 2016.
  117. O. Pelka, S. Koitka, J. Rückert, F. Nensa, and C. M. Friedrich, “Radiology objects in context (roco): a multimodal image dataset,” in Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 3.   Springer, 2018, pp. 180–189.
  118. S. Subramanian, L. L. Wang, S. Mehta, B. Bogin, M. van Zuylen, S. Parasa, S. Singh, M. Gardner, and H. Hajishirzi, “Medicat: A dataset of medical images, captions, and textual references,” arXiv preprint arXiv:2010.06000, 2020.
  119. A. G. S. de Herrera, B. Ionescu, H. Müller, R. Péteri, A. B. Abacha, C. M. Friedrich, J. Rückert, L. Bloch, R. Brüngel, A. Idrissi-Yaghir et al., “Imageclef 2022: multimedia retrieval in medical, nature, fusion, and internet applications,” in European Conference on Information Retrieval.   Springer, 2022, pp. 382–389.
  120. J. Gamper and N. Rajpoot, “Multiple instance captioning: Learning representations from histopathology textbooks and articles,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 16 549–16 559.
  121. M. Tsuneki and F. Kanavati, “Inference of captions from histopathological patches,” in International Conference on Medical Imaging with Deep Learning.   PMLR, 2022, pp. 1235–1250.
  122. D. S. Marcus, T. H. Wang, J. Parker, J. G. Csernansky, J. C. Morris, and R. L. Buckner, “Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults,” Journal of cognitive neuroscience, vol. 19, no. 9, pp. 1498–1507, 2007.
  123. G. Shih, C. C. Wu, S. S. Halabi, M. D. Kohli, L. M. Prevedello, T. S. Cook, A. Sharma, J. K. Amorosa, V. Arteaga, M. Galperin-Aizenberg et al., “Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia,” Radiology: Artificial Intelligence, vol. 1, no. 1, p. e180041, 2019.
  124. L. Wang and A. Wong, “Covid-net: a tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images. 2020,” arXiv preprint arXiv:2003.09871, 2003.
  125. B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, and M. Welling, “Rotation equivariant cnns for digital pathology,” in Medical Image Computing and Computer Assisted Intervention–MICCAI 2018: 21st International Conference, Granada, Spain, September 16-20, 2018, Proceedings, Part II 11.   Springer, 2018, pp. 210–218.
  126. X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2097–2106.
  127. A. Bustos, A. Pertusa, J.-M. Salinas, and M. De La Iglesia-Vaya, “Padchest: A large chest x-ray image dataset with multi-label annotated reports,” Medical image analysis, vol. 66, p. 101797, 2020.
  128. J. Yang, R. Shi, D. Wei, Z. Liu, L. Zhao, B. Ke, H. Pfister, and B. Ni, “Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification,” Scientific Data, vol. 10, no. 1, p. 41, 2023.
  129. Y. Miura, Y. Zhang, E. B. Tsai, C. P. Langlotz, and D. Jurafsky, “Improving factual completeness and consistency of image-to-text radiology report generation,” arXiv preprint arXiv:2010.10042, 2020.
  130. P. Rajpurkar, J. Irvin, A. Bagul, D. Ding, T. Duan, H. Mehta, B. Yang, K. Zhu, D. Laird, R. L. Ball et al., “Mura: Large dataset for abnormality detection in musculoskeletal radiographs,” arXiv preprint arXiv:1712.06957, 2017.
  131. H. Nguyen et al., “Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations.(2022) doi: 10.48550,” arXiv, 2012.
  132. S. Jaeger, S. Candemir, S. Antani, Y.-X. J. Wáng, P.-X. Lu, and G. Thoma, “Two public chest x-ray datasets for computer-aided screening of pulmonary diseases,” Quantitative imaging in medicine and surgery, vol. 4, no. 6, p. 475, 2014.
  133. M. Pavlova, N. Terhljan, A. G. Chung, A. Zhao, S. Surana, H. Aboutalebi, H. Gunraj, A. Sabri, A. Alaref, and A. Wong, “Covid-net cxr-2: An enhanced deep convolutional neural network design for detection of covid-19 cases from chest x-ray images,” Frontiers in Medicine, vol. 9, p. 861680, 2022.
  134. A. A. Borkowski, M. M. Bui, L. B. Thomas, C. P. Wilson, L. A. DeLand, and S. M. Mastorides, “Lung and colon cancer histopathological image dataset (lc25000),” arXiv preprint arXiv:1912.12142, 2019.
  135. J. Liu, J. Lian, and Y. Yu, “Chestx-det10: chest x-ray dataset on detection of thoracic abnormalities,” arXiv preprint arXiv:2006.10550, 2020.
  136. J. Wei, A. Suriawinata, B. Ren, X. Liu, M. Lisovsky, L. Vaickus, C. Brown, M. Baker, N. Tomita, L. Torresani et al., “A petri dish for histopathology image analysis,” in Artificial Intelligence in Medicine: 19th International Conference on Artificial Intelligence in Medicine, AIME 2021, Virtual Event, June 15–18, 2021, Proceedings.   Springer, 2021, pp. 11–24.
  137. S. Horng, R. Liao, X. Wang, S. Dalal, P. Golland, and S. J. Berkowitz, “Deep learning to quantify pulmonary edema in chest radiographs,” Radiology: Artificial Intelligence, vol. 3, no. 2, p. e190228, 2021.
  138. K. A. Ellis, A. I. Bush, D. Darby, D. De Fazio, J. Foster, P. Hudson, N. T. Lautenschlager, N. Lenzo, R. N. Martins, P. Maruff et al., “The australian imaging, biomarkers and lifestyle (aibl) study of aging: methodology and baseline characteristics of 1112 individuals recruited for a longitudinal study of alzheimer’s disease,” International psychogeriatrics, vol. 21, no. 4, pp. 672–687, 2009.
  139. I. B. Malone, D. Cash, G. R. Ridgway, D. G. MacManus, S. Ourselin, N. C. Fox, and J. M. Schott, “Miriad—public release of a multiple time point alzheimer’s mr imaging dataset,” NeuroImage, vol. 70, pp. 33–36, 2013.
  140. M. Li, R. Liu, F. Wang, X. Chang, and X. Liang, “Auxiliary signal-guided knowledge encoder-decoder for medical report generation,” World Wide Web, vol. 26, no. 1, pp. 253–270, 2023.
  141. A. A. A. Setio, A. Traverso, T. De Bel, M. S. Berens, C. Van Den Bogaard, P. Cerello, H. Chen, Q. Dou, M. E. Fantacci, B. Geurts et al., “Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge,” Medical image analysis, vol. 42, pp. 1–13, 2017.
  142. D. Gutman, N. C. Codella, E. Celebi, B. Helba, M. Marchetti, N. Mishra, and A. Halpern, “Skin lesion analysis toward melanoma detection: A challenge at the international symposium on biomedical imaging (isbi) 2016, hosted by the international skin imaging collaboration (isic),” arXiv preprint arXiv:1605.01397, 2016.
  143. S. Desai, A. Baghal, T. Wongsurawat, P. Jenjaroenpun, T. Powell, S. Al-Shukri, K. Gates, P. Farmer, M. Rutherford, G. Blake et al., “Chest imaging representing a covid-19 positive rural us population,” Scientific data, vol. 7, no. 1, p. 414, 2020.
  144. H. Tang, N. Sun, Y. Li, and H. Xia, “Deep learning segmentation model for automated detection of the opacity regions in the chest x-rays of the covid-19 positive patients and the application for disease severity,” medRxiv, pp. 2020–10, 2020.
  145. J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman, “A dataset of clinically generated visual questions and answers about radiology images,” Scientific data, vol. 5, no. 1, pp. 1–10, 2018.
  146. X. He, Y. Zhang, L. Mou, E. Xing, and P. Xie, “Pathvqa: 30000+ questions for medical visual question answering,” 2020.
  147. B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y. Yang, and X.-M. Wu, “Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering,” in 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI).   IEEE, 2021, pp. 1650–1654.
  148. A. Ben Abacha, S. A. Hasan, V. V. Datla, D. Demner-Fushman, and H. Müller, “Vqa-med: Overview of the medical visual question answering task at imageclef 2019,” in Proceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes.   9-12 September 2019, 2019.
  149. S. Lu, Z. Liu, T. Liu, and W. Zhou, “Scaling-up medical vision-and-language representation learning with federated learning,” Engineering Applications of Artificial Intelligence, vol. 126, p. 107037, 2023.
  150. C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki, “Laion-400m: Open dataset of clip-filtered 400 million image-text pairs,” arXiv preprint arXiv:2111.02114, 2021.
  151. M. Y. Lu, B. Chen, A. Zhang, D. F. Williamson, R. J. Chen, T. Ding, L. P. Le, Y.-S. Chuang, and F. Mahmood, “Visual language pretrained multiple instance zero-shot transfer for histopathology images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 764–19 775.
  152. S.-C. Huang, A. Pareek, M. Jensen, M. P. Lungren, S. Yeung, and A. S. Chaudhari, “Self-supervised learning for medical image classification: a systematic review and implementation guidelines,” NPJ Digital Medicine, vol. 6, no. 1, p. 74, 2023.
  153. Z. Liu, Z. Tang, X. Shi, A. Zhang, M. Li, A. Shrivastava, and A. G. Wilson, “Learning multimodal data augmentation in feature space,” arXiv preprint arXiv:2212.14453, 2022.
  154. C. Xu and J. McAuley, “A survey on model compression and acceleration for pretrained language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 9, 2023, pp. 10 566–10 575.
  155. C. Wu, X. Zhang, Y. Zhang, Y. Wang, and W. Xie, “Towards generalist foundation model for radiology,” arXiv preprint arXiv:2308.02463, 2023.
  156. K. Zhang, J. Yu, Z. Yan, Y. Liu, E. Adhikarla, S. Fu, X. Chen, C. Chen, Y. Zhou, X. Li et al., “Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks,” arXiv preprint arXiv:2305.17100, 2023.
  157. C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,” arXiv preprint arXiv:2306.00890, 2023.
  158. M. Moor, O. Banerjee, Z. S. H. Abad, H. M. Krumholz, J. Leskovec, E. J. Topol, and P. Rajpurkar, “Foundation models for generalist medical artificial intelligence,” Nature, vol. 616, no. 7956, pp. 259–265, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Prashant Shrestha (6 papers)
  2. Sanskar Amgain (5 papers)
  3. Bidur Khanal (11 papers)
  4. Cristian A. Linte (17 papers)
  5. Binod Bhattarai (60 papers)
Citations (12)
X Twitter Logo Streamline Icon: https://streamlinehq.com