Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias (2305.19894v3)
Abstract: The scarcity of data presents a critical obstacle to the efficacy of medical visionlanguage pre-training (VLP). A potential solution lies in the combination of datasets from various language communities. Nevertheless, the main challenge stems from the complexity of integrating diverse syntax and semantics, language-specific medical terminology, and culture-specific implicit knowledge. Therefore, one crucial aspect to consider is the presence of community bias caused by different languages. This paper presents a novel framework named Unifying Cross-Lingual Medical Vision-Language Pre-Training (Med-UniC), designed to integrate multimodal medical data from the two most prevalent languages, English and Spanish. Specifically, we propose Cross-lingual Text Alignment Regularization (CTR) to explicitly unify cross-lingual semantic representations of medical reports originating from diverse language communities. CTR is optimized through latent language disentanglement, rendering our optimization objective to not depend on negative samples, thereby significantly mitigating the bias from determining positive-negative sample pairs within analogous medical reports. Furthermore, it ensures that the cross-lingual representation is not biased toward any specific language community. Med-UniC reaches superior performance across 5 medical image tasks and 10 datasets encompassing over 30 diseases, offering a versatile framework for unifying multi-modal medical data within diverse linguistic communities. The experimental outcomes highlight the presence of community bias in cross-lingual VLP. Reducing this bias enhances the performance not only in vision-language tasks but also in uni-modal visual tasks.
- M. Ni, H. Huang, L. Su, E. Cui, T. Bharti, L. Wang, D. Zhang, and N. Duan, “M3p: Learning universal representations via multitask multilingual multimodal pre-training,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3977–3986.
- M. Zhou, L. Zhou, S. Wang, Y. Cheng, L. Li, Z. Yu, and J. Liu, “Uc2: Universal cross-lingual cross-modal vision-and-language pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 4155–4165.
- A. Jain, M. Guo, K. Srinivasan, T. Chen, S. Kudugunta, C. Jia, Y. Yang, and J. Baldridge, “MURAL: Multimodal, multitask representations across languages,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Nov. 2021.
- Y. Zeng, W. Zhou, A. Luo, and X. Zhang, “Cross-view language modeling: Towards unified cross-lingual cross-modal pre-training,” arXiv preprint arXiv:2206.00621, 2022.
- Y. Zhang, H. Jiang, Y. Miura, C. D. Manning, and C. P. Langlotz, “Contrastive learning of medical visual representations from paired images and text,” arXiv preprint arXiv:2010.00747, 2020.
- S.-C. Huang, L. Shen, M. P. Lungren, and S. Yeung, “Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3942–3951.
- F. Wang, Y. Zhou, S. Wang, V. Vardhanabhuti, and L. Yu, “Multi-granularity cross-modal alignment for generalized medical visual representation learning,” arXiv preprint arXiv:2210.06044, 2022.
- C. Wu, X. Zhang, Y. Zhang, Y. Wang, and W. Xie, “Medklip: Medical knowledge enhanced language-image pre-training,” medRxiv, pp. 2023–01, 2023.
- H.-Y. Zhou, C. Lian, L. Wang, and Y. Yu, “Advancing radiograph representation learning with masked record modeling,” in The Eleventh International Conference on Learning Representations.
- S. Long, F. Cao, S. C. Han, and H. Yang, “Vision-and-language pretrained models: A survey.”
- OpenAI, “Gpt-4 technical report,” ArXiv, vol. abs/2303.08774, 2023.
- A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
- M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv preprint arXiv:2304.07193, 2023.
- A. Gilson, C. Safranek, T. Huang, V. Socrates, L. Chi, R. A. Taylor, and D. Chartash, “How well does chatgpt do when taking the medical licensing exams? the implications of large language models for medical education and knowledge assessment,” medRxiv, pp. 2022–12, 2022.
- T. H. Kung, M. Cheatham, A. Medenilla, C. Sillos, L. De Leon, C. Elepaño, M. Madriaga, R. Aggabao, G. Diaz-Candido, J. Maningo et al., “Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models,” PLoS digital health, vol. 2, no. 2, p. e0000198, 2023.
- Y. Huang, X. Yang, L. Liu, H. Zhou, A. Chang, X. Zhou, R. Chen, J. Yu, J. Chen, C. Chen et al., “Segment anything model for medical images?” arXiv preprint arXiv:2304.14660, 2023.
- W. Ji, J. Li, Q. Bi, W. Li, and L. Cheng, “Segment anything is not always perfect: An investigation of sam on different real-world applications,” arXiv preprint arXiv:2304.05750, 2023.
- S. He, R. Bao, J. Li, P. E. Grant, and Y. Ou, “Accuracy of segment-anything model (sam) in medical image segmentation tasks,” arXiv preprint arXiv:2304.09324, 2023.
- N. Mu, A. Kirillov, D. Wagner, and S. Xie, “Slip: Self-supervision meets language-image pre-training,” in European Conference on Computer Vision. Springer, 2022, pp. 529–544.
- B. Boecking, N. Usuyama, S. Bannur, D. C. de Castro, A. Schwaighofer, S. L. Hyland, M. T. A. Wetscherek, T. Naumann, A. Nori, J. Alvarez-Valle, H. Poon, and O. Oktay, “Making the most of text semantics to improve biomedical vision-language processing,” ArXiv, ECCV.
- A. Bustos, A. Pertusa, J. M. Salinas, and M. de la Iglesia-Vayá, “Padchest: A large chest x-ray image dataset with multi-label annotated reports,” Medical image analysis, vol. 66, p. 101797, 2019.
- A. E. W. Johnson, T. J. Pollard, S. J. Berkowitz, N. R. Greenbaum, M. P. Lungren, C. ying Deng, R. G. Mark, and S. Horng, “Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports,” Scientific Data, vol. 6, 2019.
- J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya et al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, 2019, pp. 590–597.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” ArXiv, vol. abs/1810.04805, 2019.
- ——, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- I. J. Goodfellow, M. Mirza, X. Da, A. C. Courville, and Y. Bengio, “An empirical investigation of catastrophic forgeting in gradient-based neural networks,” CoRR, vol. abs/1312.6211, 2013.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning. PMLR, 2021, pp. 8748–8763.
- A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” ArXiv, vol. abs/1807.03748, 2018.
- J. Singh, B. McCann, R. Socher, and C. Xiong, “Bert is not an interlingua and the bias of tokenization,” in Conference on Empirical Methods in Natural Language Processing, 2019.
- R. Xian, H. Ji, and H. Zhao, “Cross-lingual transfer with class-weighted language-invariant representations,” in International Conference on Learning Representations, 2022.
- A. Lauscher, V. Ravishankar, I. Vulic, and G. Glavas, “From zero to hero: On the limitations of zero-shot language transfer with multilingual transformers,” in Conference on Empirical Methods in Natural Language Processing, 2020.
- K. K, Z. Wang, S. Mayhew, and D. Roth, “Cross-lingual ability of multilingual bert: An empirical study,” International Conference on Learning Representations, vol. abs/1912.07840, 2020.
- P. Dufter and H. Schütze, “Identifying elements essential for bert’s multilinguality,” in Conference on Empirical Methods in Natural Language Processing, 2020.
- J. Zhang and Z. Lan, “S-simcse: Sampled sub-networks for contrastive learning of sentence embedding,” Conference on Empirical Methods in Natural Language, 2021.
- H. B. Barlow, “Redundancy reduction revisited,” Network: Computation in Neural Systems, vol. 12, pp. 241 – 253, 2001.
- J. Zbontar, L. Jing, I. Misra, Y. LeCun, and S. Deny, “Barlow twins: Self-supervised learning via redundancy reduction,” in International Conference on Machine Learning, 2021.
- S. Goel, H. Bansal, S. K. Bhatia, R. A. Rossi, V. Vinay, and A. Grover, “Cyclip: Cyclic contrastive language-image pretraining,” NIPS, 2022.
- A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, Y. Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng, “Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs,” arXiv preprint arXiv:1901.07042, 2019.
- A. Bustos, A. Pertusa, J.-M. Salinas, and M. de la Iglesia-Vayá, “Padchest: A large chest x-ray image dataset with multi-label annotated reports,” Medical image analysis, vol. 66, p. 101797, 2020.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
- G. Shih, C. C. Wu, S. S. Halabi, M. D. Kohli, L. M. Prevedello, T. S. Cook, A. Sharma, J. K. Amorosa, V. Arteaga, M. Galperin-Aizenberg et al., “Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia,” Radiology: Artificial Intelligence, vol. 1, no. 1, p. e180041, 2019.
- L. Wang, Z. Q. Lin, and A. Wong, “Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images,” Scientific reports, vol. 10, no. 1, pp. 1–12, 2020.
- A. Saporta, X. Gui, A. Agrawal, A. Pareek, S. Q. Truong, C. D. Nguyen, V.-D. Ngo, J. Seekins, F. G. Blankenberg, A. Y. Ng et al., “Benchmarking saliency methods for chest x-ray interpretation,” Nature Machine Intelligence, vol. 4, no. 10, pp. 867–878, 2022.
- C. Steven G. Langer, PhD and M. George Shih, MD, “Siim-acr pneumothorax segmentation,” 2019.
- O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.
- J. Healthcare, “Object-cxr-automatic detection of foreign objects on chest x-rays,” 2020.
- J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,” arXiv preprint arXiv:1804.02767, 2018.
- K. He, X. Chen, S. Xie, Y. Li, P. Doll’ar, and R. B. Girshick, “Masked autoencoders are scalable vision learners,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15 979–15 988, 2021.
- R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra, “Grad-cam: Visual explanations from deep networks via gradient-based localization,” International Journal of Computer Vision, vol. 128, pp. 336–359, 2016.
- M. Oquab, T. Darcet, T. Moutakanni, H. Q. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y. Huang, S.-W. Li, I. Misra, M. G. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jégou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without supervision,” ArXiv, vol. abs/2304.07193, 2023.
- Z. Wan, Y. Yin, W. Zhang, J. Shi, L. Shang, G. Chen, X. Jiang, and Q. Liu, “G-map: General memory-augmented pre-trained language model for domain tasks,” in Conference on Empirical Methods in Natural Language Processing, 2022.
- J. Li, C. Liu, S. Cheng, R. Arcucci, and S. Hong, “Frozen language model helps ecg zero-shot learning,” in Medical Imaging with Deep Learning, 2023.
- S. Cheng, C. Quilodrán-Casas, S. Ouala, A. Farchi, C. Liu, P. Tandeo, R. Fablet, D. Lucor, B. Iooss, J. Brajard et al., “Machine learning with data assimilation and uncertainty quantification for dynamical systems: a review,” arXiv preprint arXiv:2303.10462, 2023.
- S. Cheng, I. C. Prentice, Y. Huang, Y. Jin, Y.-K. Guo, and R. Arcucci, “Data-driven surrogate model with latent data assimilation: Application to wildfire forecasting,” Journal of Computational Physics, vol. 464, p. 111302, 2022.
- C. Liu, R. Fu, D. Xiao, R. Stefanescu, P. Sharma, C. Zhu, S. Sun, and C. Wang, “Enkf data-driven reduced order assimilation system,” Engineering Analysis with Boundary Elements, vol. 139, pp. 46–55, 2022.
- C. Liu, Z. Wan, S. Cheng, M. Zhang, and R. Arcucci, “Etp: Learning transferable ecg representations via ecg-text pre-training,” arXiv preprint arXiv:2309.07145, 2023.
- Z. Wan, B. Wang, X. Liu, J. Qiu, B. Li, T. Guo, G. Chen, and Y. Wang, “Spatio-temporal contrastive learning enhanced gnns for session-based recommendation,” ArXiv, vol. abs/2209.11461, 2022.
- S. Zheng, J. Lu, H. Zhao, X. Zhu, Z. Luo, Y. Wang, Y. Fu, J. Feng, T. Xiang, P. H. S. Torr, and L. Zhang, “Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6877–6886, 2020.
- Zhongwei Wan (39 papers)
- Che Liu (59 papers)
- Mi Zhang (85 papers)
- Jie Fu (229 papers)
- Benyou Wang (109 papers)
- Sibo Cheng (36 papers)
- Lei Ma (195 papers)
- César Quilodrán-Casas (11 papers)
- Rossella Arcucci (50 papers)