Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LaPA: Latent Prompt Assist Model For Medical Visual Question Answering (2404.13039v1)

Published 19 Apr 2024 in cs.CV and cs.CL

Abstract: Medical visual question answering (Med-VQA) aims to automate the prediction of correct answers for medical images and questions, thereby assisting physicians in reducing repetitive tasks and alleviating their workload. Existing approaches primarily focus on pre-training models using additional and comprehensive datasets, followed by fine-tuning to enhance performance in downstream tasks. However, there is also significant value in exploring existing models to extract clinically relevant information. In this paper, we propose the Latent Prompt Assist model (LaPA) for medical visual question answering. Firstly, we design a latent prompt generation module to generate the latent prompt with the constraint of the target answer. Subsequently, we propose a multi-modal fusion block with latent prompt fusion module that utilizes the latent prompt to extract clinical-relevant information from uni-modal and multi-modal features. Additionally, we introduce a prior knowledge fusion module to integrate the relationship between diseases and organs with the clinical-relevant information. Finally, we combine the final integrated information with image-language cross-modal information to predict the final answers. Experimental results on three publicly available Med-VQA datasets demonstrate that LaPA outperforms the state-of-the-art model ARL, achieving improvements of 1.83%, 0.63%, and 1.80% on VQA-RAD, SLAKE, and VQA-2019, respectively. The code is publicly available at https://github.com/GaryGuTC/LaPA_model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. CLEF (working notes), 2(6), 2019.
  2. Multi-modal masked autoencoders for medical vision-and-language pre-training. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 679–689. Springer, 2022.
  3. Align, reason and learn: Enhancing medical vision-and-language pre-training with knowledge. In Proceedings of the 30th ACM International Conference on Multimedia, pages 5152–5161, 2022.
  4. Multiple meta-model quantifying for medical visual question answering. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24, pages 64–74. Springer, 2021.
  5. Does clip benefit visual question answering in the medical domain as much as it does in the general domain? arXiv preprint arXiv:2112.13906, 2021.
  6. Vqamix: Conditional triplet mixup for medical visual question answering. IEEE Transactions on Medical Imaging, 41(11):3332–3343, 2022.
  7. Irving John Good. Rational decisions. Journal of the Royal Statistical Society: Series B, 14(1):107–114, 1952.
  8. Complex organ mask guided radiology report generation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 7995–8004, 2024.
  9. PPT: Pre-trained prompt tuning for few-shot learning. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022.
  10. Mmbert: Multimodal bert pretraining for improved medical vqa. In 2021 IEEE 18th International Symposium on Biomedical Imaging, pages 1033–1036. IEEE, 2021.
  11. Bilinear Attention Networks. In Advances in Neural Information Processing Systems 31, pages 1571–1581, 2018.
  12. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018.
  13. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36, 2024.
  14. Masked vision and language pre-training with unimodal and multimodal contrastive losses for medical visual question answering. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 374–383. Springer, 2023.
  15. Self-supervised vision-language pretraining for medial visual question answering. In 2023 IEEE 20th International Symposium on Biomedical Imaging, pages 1–5. IEEE, 2023.
  16. Medical visual question answering: A survey. Artificial Intelligence in Medicine, page 102611, 2023.
  17. Contrastive pre-training and representation distillation for medical visual question answering based on radiology images. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part II 24, pages 210–220. Springer, 2021.
  18. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging, pages 1650–1654. IEEE, 2021.
  19. Medical visual question answering via conditional reasoning and contrastive learning. IEEE transactions on medical imaging, 42(5):1532–1545, 2022.
  20. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  21. Q2atransformer: Improving medical vqa via an answer querying decoder. In International Conference on Information Processing in Medical Imaging, pages 445–456. Springer, 2023.
  22. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021.
  23. Decoupled weight decay regularization. In ICLR, 2018.
  24. Overcoming data limitation in medical visual question answering. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2019: 22nd International Conference, Shenzhen, China, October 13–17, 2019, Proceedings, Part IV 22, pages 522–530. Springer, 2019.
  25. Radiology objects in context (roco): a multimodal image dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 3, pages 180–189. Springer, 2018.
  26. Prompt for extraction: Multiple templates choice model for event extraction. Knowledge-Based Systems, page 111544, 2024.
  27. Morgan D. Polak, M.P. Extracting accurate materials data from research papers with conversational language models and prompt engineering. Nat Commun 15, 1569 (2024)., 2024.
  28. Medicat: A dataset of medical images, captions, and textual references. arXiv preprint arXiv:2010.06000, 2020.
  29. Interactive and explainable region-guided radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7433–7442, 2023.
  30. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  31. Graph attention networks. arXiv preprint arXiv:1710.10903, 2017.
  32. Stacked attention networks for image question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 21–29, 2016.
  33. Type-aware medical visual question answering. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4838–4842. IEEE, 2022.
  34. Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415, 2023.
  35. Latent prompt tuning for text summarization, 2022.
  36. Graph neural networks: A review of methods and applications. AI open, 1:57–81, 2020.
  37. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16816–16825, 2022.
  38. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Tiancheng Gu (10 papers)
  2. Kaicheng Yang (21 papers)
  3. Dongnan Liu (47 papers)
  4. Weidong Cai (118 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets