Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MISS: A Generative Pretraining and Finetuning Approach for Med-VQA (2401.05163v3)

Published 10 Jan 2024 in cs.CV and cs.AI

Abstract: Medical visual question answering (VQA) is a challenging multimodal task, where Vision-Language Pre-training (VLP) models can effectively improve the generalization performance. However, most methods in the medical field treat VQA as an answer classification task which is difficult to transfer to practical application scenarios. Additionally, due to the privacy of medical images and the expensive annotation process, large-scale medical image-text pairs datasets for pretraining are severely lacking. In this paper, we propose a large-scale MultI-task Self-Supervised learning based framework (MISS) for medical VQA tasks. Unlike existing methods, we treat medical VQA as a generative task. We unify the text encoder and multimodal encoder and align image-text features through multi-task learning. Furthermore, we propose a Transfer-and-Caption method that extends the feature space of single-modal image datasets using LLMs, enabling those traditional medical vision field task data to be applied to VLP. Experiments show that our method achieves excellent results with fewer multimodal datasets and demonstrates the advantages of generative VQA models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. J. Li, R. Selvaraju, Gotmare et al., “Align before fuse: Vision and language representation learning with momentum distillation,” NIPS, vol. 34, pp. 9694–9705, 2021.
  2. J. Li, D. Li, Xiong et al., “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICCV, 2022, pp. 12 888–12 900.
  3. Y.-C. Chen, L. Li, Yu et al., “Uniter: Universal image-text representation learning,” in European conference on computer vision.   Springer, 2020, pp. 104–120.
  4. J. Devlin, M.-W. Chang, Lee et al., “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  5. L. H. Li, M. Yatskar, Yin et al., “Visualbert: A simple and performant baseline for vision and language,” arXiv preprint arXiv:1908.03557, 2019.
  6. A. Radford, J. W. Kim, Hallacy et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  7. A. Vaswani, N. Shazeer, Parmar et al., “Attention is all you need,” NIPS, vol. 30, 2017.
  8. K. He, X. Chen, Xie et al., “Masked autoencoders are scalable vision learners,” in CVPR, 2022, pp. 16 000–16 009.
  9. K. He, H. Fan, Wu et al., “Momentum contrast for unsupervised visual representation learning,” 2020, pp. 9729–9738.
  10. S. Zhang, Y. Xu, Usuyama et al., “Large-scale domain-specific pretraining for biomedical vision-language processing,” arXiv preprint arXiv:2303.00915, 2023.
  11. C. Chen, A. Zhong, Wu et al., “Contrastive masked image-text modeling for medical visual representation learning,” in MICCAI.   Springer, 2023, pp. 493–503.
  12. P. Li, G. Liu, He et al., “Masked vision and language pre-training with unimodal and multimodal contrastive losses for medical visual question answering,” in MICCAI.   Springer, 2023, pp. 374–383.
  13. Z. Chen, Y. Du, Hu et al., “Multi-modal masked autoencoders for medical vision-and-language pre-training,” in MICCAI.   Springer, 2022, pp. 679–689.
  14. J. J. Lau, Gayen et al., “A dataset of clinically generated visual questions and answers about radiology images,” Scientific data, vol. 5, no. 1, pp. 1–10, 2018.
  15. T. Do, B. X. Nguyen et al., “Multiple meta-model quantifying for medical visual question answering,” in MICCAI, Cham, 2021, pp. 64–74.
  16. B. D. Nguyen, T.-T. Do, B. X. Nguyen et al., “Overcoming data limitation in medical visual question answering,” in MICCAI, Cham, 2019, pp. 522–530.
  17. H. Gong, G. Chen, Mao et al., “Vqamix: Conditional triplet mixup for medical visual question answering,” IEEE Transactions on Medical Imaging, vol. 41, no. 11, pp. 3332–3343, 2022.
  18. B. Liu, L.-M. Zhan, and X.-M. Wu, “Contrastive pre-training and representation distillation for medical visual question answering based on radiology images,” in MICCAI 2021.   Cham: Springer International Publishing, 2021, pp. 210–220.
  19. S. Eslami, G. de Melo, and C. Meinel, “Does CLIP benefit visual question answering in the medical domain as much as it does in the general domain?” CoRR, vol. abs/2112.13906, 2021. [Online]. Available: https://arxiv.org/abs/2112.13906
  20. O. Pelka, S. Koitka, Rückert et al., “Radiology objects in context (roco): a multimodal image dataset,” in LABELS 2018, MICCAI 2018, 2018, pp. 180–189.
  21. B. Liu, L.-M. Zhan, Xu et al., “Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering,” in 2021 ISBI, 2021, pp. 1650–1654.
  22. S. M. Sanjay Subramanian, Lucy Lu Wang et al., “MedICaT: A Dataset of Medical Images, Captions, and Textual References,” in Findings of EMNLP, 2020.
  23. H. Pan, S. He, K. Zhang et al., “Amam: An attention-based multimodal alignment model for medical visual question answering,” KBS, vol. 255, p. 109763, 2022.
  24. F. Cong, S. Xu et al., “Caption-aware medical vqa via semantic focusing and progressive cross-modality comprehension,” in ACM MM, 2022, pp. 3569–3577.
  25. C. Li, C. Wong, Zhang et al., “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,” arXiv preprint arXiv:2306.00890, 2023.
  26. H. Liu, C. Li, Wu et al., “Visual instruction tuning,” arXiv preprint arXiv:2304.08485, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jiawei Chen (160 papers)
  2. Dingkang Yang (57 papers)
  3. Yue Jiang (104 papers)
  4. Yuxuan Lei (12 papers)
  5. Lihua Zhang (68 papers)
Citations (10)