Unified Multi-modal Diagnostic Framework with Reconstruction Pre-training and Heterogeneity-combat Tuning (2404.06057v1)
Abstract: Medical multi-modal pre-training has revealed promise in computer-aided diagnosis by leveraging large-scale unlabeled datasets. However, existing methods based on masked autoencoders mainly rely on data-level reconstruction tasks, but lack high-level semantic information. Furthermore, two significant heterogeneity challenges hinder the transfer of pre-trained knowledge to downstream tasks, \textit{i.e.}, the distribution heterogeneity between pre-training data and downstream data, and the modality heterogeneity within downstream data. To address these challenges, we propose a Unified Medical Multi-modal Diagnostic (UMD) framework with tailored pre-training and downstream tuning strategies. Specifically, to enhance the representation abilities of vision and language encoders, we propose the Multi-level Reconstruction Pre-training (MR-Pretrain) strategy, including a feature-level and data-level reconstruction, which guides models to capture the semantic information from masked inputs of different modalities. Moreover, to tackle two kinds of heterogeneities during the downstream tuning, we present the heterogeneity-combat downstream tuning strategy, which consists of a Task-oriented Distribution Calibration (TD-Calib) and a Gradient-guided Modality Coordination (GM-Coord). In particular, TD-Calib fine-tunes the pre-trained model regarding the distribution of downstream datasets, and GM-Coord adjusts the gradient weights according to the dynamic optimization status of different modalities. Extensive experiments on five public medical datasets demonstrate the effectiveness of our UMD framework, which remarkably outperforms existing approaches on three kinds of downstream tasks.
- X. Chen, X. Wang, K. Zhang, K.-M. Fung, T. C. Thai, K. Moore, R. S. Mannel, H. Liu, B. Zheng, and Y. Qiu, “Recent advances and clinical applications of deep learning in medical image analysis,” Med. Image Anal., vol. 79, p. 102444, 2022.
- X. Xie, J. Niu, X. Liu, Z. Chen, S. Tang, and S. Yu, “A survey on incorporating domain knowledge into deep learning for medical image analysis,” Med. Image Anal., vol. 69, p. 101985, 2021.
- S. Wang, C. Li, R. Wang, Z. Liu, M. Wang, H. Tan, Y. Wu, X. Liu, H. Sun, R. Yang et al., “Annotation-efficient deep learning for automatic medical image segmentation,” Nat. Commun., vol. 12, no. 1, p. 5915, 2021.
- S.-C. Huang, A. Pareek, M. Jensen, M. P. Lungren, S. Yeung, and A. S. Chaudhari, “Self-supervised learning for medical image classification: a systematic review and implementation guidelines,” NPJ Digital Medicine, vol. 6, no. 1, p. 74, 2023.
- Á. S. Hervella, J. Rouco, J. Novo, and M. Ortega, “Self-supervised multimodal reconstruction pre-training for retinal computer-aided diagnosis,” Expert Syst. Appl., vol. 185, p. 115598, 2021.
- Z. Chen, Y. Du, J. Hu, Y. Liu, G. Li, X. Wan, and T.-H. Chang, “Multi-modal masked autoencoders for medical vision-and-language pre-training,” in MICCAI. Springer, 2022, pp. 679–689.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021, pp. 8748–8763.
- J. Yang, J. Duan, S. Tran, Y. Xu, S. Chanda, L. Chen, B. Zeng, T. Chilimbi, and J. Huang, “Vision-language pre-training with triple contrastive learning,” in CVPR, 2022, pp. 15 671–15 680.
- A. Singh, R. Hu, V. Goswami, G. Couairon, W. Galuba, M. Rohrbach, and D. Kiela, “Flava: A foundational language and vision alignment model,” in CVPR, 2022, pp. 15 638–15 650.
- G. Kwon, Z. Cai, A. Ravichandran, E. Bas, R. Bhotika, and S. Soatto, “Masked vision and language modeling for multi-modal representation learning,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=ZhuXksSJYWn
- Y. Li, H. Fan, R. Hu, C. Feichtenhofer, and K. He, “Scaling language-image pre-training via masking,” in CVPR, 2023, pp. 23 390–23 400.
- G. Wang, K. Wang, G. Wang, P. H. Torr, and L. Lin, “Solving inefficiency of self-supervised representation learning,” in ICCV, 2021, pp. 9505–9515.
- W. Wang, J. Wang, C. Chen, J. Jiao, L. Sun, Y. Cai, S. Song, and J. Li, “Fremae: Fourier transform meets masked autoencoders for medical image segmentation,” arXiv preprint arXiv:2304.10864, 2023.
- Z. Qing, S. Zhang, Z. Huang, X. Wang, Y. Wang, Y. Lv, C. Gao, and N. Sang, “Mar: Masked autoencoders for efficient action recognition,” IEEE Transactions on Multimedia, 2023.
- K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in CVPR, 2022, pp. 16 000–16 009.
- C. Wei, H. Fan, S. Xie, C.-Y. Wu, A. Yuille, and C. Feichtenhofer, “Masked feature prediction for self-supervised visual pre-training,” in CVPR, 2022, pp. 14 668–14 678.
- H. Wang, Y. Tang, Y. Wang, J. Guo, Z.-H. Deng, and K. Han, “Masked image modeling with local multi-scale reconstruction,” in CVPR, 2023, pp. 2122–2131.
- D. G. Lowe, “Object recognition from local scale-invariant features,” in ICCV, vol. 2, 1999, pp. 1150–1157.
- N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, vol. 1, 2005, pp. 886–893.
- S. Gururangan, A. Marasović, S. Swayamdipta, K. Lo, I. Beltagy, D. Downey, and N. A. Smith, “Don’t stop pretraining: Adapt language models to domains and tasks,” in Proceedings of ACL, 2020.
- Y. Gandelsman, Y. Sun, X. Chen, and A. Efros, “Test-time training with masked autoencoders,” NeurIPS, vol. 35, pp. 29 374–29 385, 2022.
- A. Kumar, A. Raghunathan, R. M. Jones, T. Ma, and P. Liang, “Fine-tuning can distort pretrained features and underperform out-of-distribution,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. [Online]. Available: https://openreview.net/forum?id=UYneFzXSJWh
- Y. Li, H. Wang, and Y. Luo, “A comparison of pre-trained vision-and-language models for multimodal representation learning across medical images and reports,” in BIBM. IEEE, 2020, pp. 1999–2004.
- R. Chang, Y.-X. Wang, and E. Ertekin, “Towards overcoming data scarcity in materials science: unifying models and datasets with a mixture of experts framework,” npj Computational Materials, vol. 8, no. 1, p. 242, 2022.
- B. D. Nguyen, T.-T. Do, B. X. Nguyen, T. Do, E. Tjiputra, and Q. D. Tran, “Overcoming data limitation in medical visual question answering,” in MICCAI. Springer, 2019, pp. 522–530.
- W. Su, X. Zhu, C. Tao, L. Lu, B. Li, G. Huang, Y. Qiao, X. Wang, J. Zhou, and J. Dai, “Towards all-in-one pre-training via maximizing multi-modal mutual information,” in CVPR, 2023, pp. 15 888–15 899.
- W. Wang, D. Tran, and M. Feiszli, “What makes training multi-modal classification networks hard?” in CVPR, 2020, pp. 12 695–12 705.
- X. Peng, Y. Wei, A. Deng, D. Wang, and D. Hu, “Balanced multimodal learning via on-the-fly gradient modulation,” in CVPR, 2022, pp. 8238–8247.
- J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, 2019, pp. 4171–4186. [Online]. Available: https://doi.org/10.18653/v1/n19-1423
- J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” in Proceedings of the 40th International Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023.
- S. Ren, F. Wei, S. Albanie, Z. Zhang, and H. Hu, “Deepmim: Deep supervision for masked image modeling,” arXiv preprint arXiv:2303.08817, 2023.
- O. Pelka, S. Koitka, J. Rückert, F. Nensa, and C. M. Friedrich, “Radiology objects in context (roco): a multimodal image dataset,” in Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis. Springer, 2018, pp. 180–189.
- S. Subramanian, L. L. Wang, B. Bogin, S. Mehta, M. van Zuylen, S. Parasa, S. Singh, M. Gardner, and H. Hajishirzi, “Medicat: A dataset of medical images, captions, and textual references,” in Findings of the Association for Computational Linguistics: EMNLP 2020, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 2112–2120. [Online]. Available: https://aclanthology.org/2020.findings-emnlp.191
- Y. Khare, V. Bagal, M. Mathew, A. Devi, U. D. Priyakumar, and C. Jawahar, “Mmbert: Multimodal bert pretraining for improved medical vqa,” in ISBI. IEEE, 2021, pp. 1033–1036.
- Y. Zhang, H. Jiang, Y. Miura, C. D. Manning, and C. P. Langlotz, “Contrastive learning of medical visual representations from paired images and text,” in MLHC, 2022, pp. 2–25.
- M. Endo, K. L. Poston, E. V. Sullivan, L. Fei-Fei, K. M. Pohl, and E. Adeli, “Gaitforemer: Self-supervised pre-training of transformers via human motion forecasting for few-shot gait impairment severity estimation,” in MICCAI. Springer, 2022, pp. 130–139.
- J. H. Moon, H. Lee, W. Shin, Y.-H. Kim, and E. Choi, “Multi-modal understanding and generation for medical images and text via vision-language pre-training,” IEEE J Biomed Health Inform, vol. 26, no. 12, pp. 6070–6080, 2022.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” CoRR, vol. abs/1907.11692, 2019. [Online]. Available: http://arxiv.org/abs/1907.11692
- A. Tarvainen and H. Valpola, “Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results,” NeurIPS, vol. 30, pp. 1195–1204, 2017.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, vol. 30, pp. 6000–6010, 2017.
- L. J. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” CoRR, vol. abs/1607.06450, 2016. [Online]. Available: http://arxiv.org/abs/1607.06450
- D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
- T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, 2020, pp. 1597–1607.
- T. Liu, Z. Wu, W. Xiong, J. Chen, and Y.-G. Jiang, “Unified multimodal pre-training and prompt-based tuning for vision-language understanding and generation,” arXiv preprint arXiv:2112.05587, 2021.
- Z. Yu, J. Yu, J. Fan, and D. Tao, “Multi-modal factorized bilinear pooling with co-attention learning for visual question answering,” in ICCV, 2017, pp. 1821–1830.
- Z. Yang, X. He, J. Gao, L. Deng, and A. Smola, “Stacked attention networks for image question answering,” in CVPR, 2016, pp. 21–29.
- J.-H. Kim, J. Jun, and B.-T. Zhang, “Bilinear attention networks,” NeurIPS, vol. 31, pp. 1571–1581, 2018.
- B. Liu, L.-M. Zhan, and X.-M. Wu, “Contrastive pre-training and representation distillation for medical visual question answering based on radiology images,” in MICCAI. Springer, 2021, pp. 210–220.
- J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman, “A dataset of clinically generated visual questions and answers about radiology images,” Sci. Data, vol. 5, no. 1, pp. 1–10, 2018.
- B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y. Yang, and X.-M. Wu, “Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering,” in ISBI. IEEE, 2021, pp. 1650–1654.
- A. B. Abacha, S. A. Hasan, V. V. Datla, J. Liu, D. Demner-Fushman, and H. Müller, “Vqa-med: Overview of the medical visual question answering task at imageclef 2019,” in CLEF, 2019, pp. 9–12.
- T.-L. Wu, S. Singh, S. Paul, G. Burns, and N. Peng, “Melinda: A multimodal dataset for biomedical experiment method classification,” in AAAI, vol. 35, no. 16, 2021, pp. 14 076–14 084.
- Z.-Y. Dou, Y. Xu, Z. Gan, J. Wang, S. Wang, L. Wang, C. Zhu, P. Zhang, L. Yuan, N. Peng et al., “An empirical study of training end-to-end vision-and-language transformers,” in CVPR, 2022, pp. 18 166–18 176.
- W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in ICML, 2021, pp. 5583–5594.
- W. A. Falcon, “Pytorch lightning,” GitHub, vol. 3, 2019.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
- I. Beltagy, K. Lo, and A. Cohan, “Scibert: A pretrained language model for scientific text,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3615–3620.
- J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” NeurIPS, vol. 32, pp. 13–23, 2019.
- L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” J. Mach. Learn. Res., vol. 9, no. 11, pp. 2579–2605, 2008.
- Yupei Zhang (12 papers)
- Li Pan (25 papers)
- Qiushi Yang (10 papers)
- Tan Li (21 papers)
- Zhen Chen (151 papers)