Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering (2404.16192v1)

Published 24 Apr 2024 in cs.CL and cs.CV

Abstract: Vision-LLMs, while effective in general domains and showing strong performance in diverse multi-modal applications like visual question-answering (VQA), struggle to maintain the same level of effectiveness in more specialized domains, e.g., medical. We propose a medical vision-LLM that integrates large vision and LLMs adapted for the medical domain. This model goes through three stages of parameter-efficient training using three separate biomedical and radiology multi-modal visual and text datasets. The proposed model achieves state-of-the-art performance on the SLAKE 1.0 medical VQA (MedVQA) dataset with an overall accuracy of 87.5% and demonstrates strong performance on another MedVQA dataset, VQA-RAD, achieving an overall accuracy of 73.2%.

Authors (6)

Cuong Nhat Ha (1 paper)
Shima Asaadi (3 papers)
Sanjeev Kumar Karn (10 papers)
Oladimeji Farri (12 papers)
Tobias Heimann (4 papers)
Thomas Runkler (34 papers)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/drkhan/status/1783903872886706598

Fusion of Domain-Adapted Vision and Language Models for Medical Visual Question Answering (2404.16192v1)

Summary

Related Papers

Tweets