Hallucination Benchmark in Medical Visual Question Answering (2401.05827v2)
Abstract: The recent success of large language and vision models (LLVMs) on vision question answering (VQA), particularly their applications in medicine (Med-VQA), has shown a great potential of realizing effective visual assistants for healthcare. However, these models are not extensively tested on the hallucination phenomenon in clinical settings. Here, we created a hallucination benchmark of medical images paired with question-answer sets and conducted a comprehensive evaluation of the state-of-the-art models. The study provides an in-depth analysis of current models' limitations and reveals the effectiveness of various prompting strategies.
- Pathological visual question answering. arXiv preprint arXiv:2010.12435, 2020.
- A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023.
- Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
- OpenAI. Gpt-4, 2023. URL https://www.openai.com/gpt-4.
- Med-halt: Medical domain hallucination test for large language models. arXiv preprint arXiv:2307.15343, 2023.
- Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415, 2023.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.