MedVQA: Visual Q&A in Medical Imaging
- MedVQA is an interdisciplinary field that combines computer vision and natural language processing to answer clinical questions based on medical images.
- It employs multi-stage models with domain-adapted encoders and attention-based fusion to effectively integrate multimodal data.
- Current research emphasizes causal reasoning, debiasing, and interpretability to enhance diagnostic accuracy and support clinical workflows.
Medical Visual Question Answering (MedVQA) is an interdisciplinary domain that focuses on enabling artificial intelligence systems to answer natural language questions about medical images. The objective is to deliver plausible, clinically relevant responses to automate or assist in diagnostic workflows, radiology report comprehension, disease screening, and educational applications. MedVQA builds on methodologies from computer vision, natural language processing, causal inference, and clinical informatics, integrating them to solve specific challenges presented by complex, multimodal medical data.
1. Task Definition, Datasets, and Domain Challenges
MedVQA requires a system to receive a medical image (such as a chest X-ray, MRI, or CT) and a natural language question, returning an answer that may range from single words (binary Yes/No), multiple-choice selections, or free-form descriptive responses (Lin et al., 2021).
A variety of benchmark datasets have been created (with differing scope, question complexity, and modality):
Dataset | Modality | QA Pairs & Size | Notable Features |
---|---|---|---|
VQA-RAD | Radiology (X-ray, CT, MRI) | 3,515 QA pairs | Manually curated, free-form clinician questions |
CLEF18/19/20/21 | Broad (mostly radiology) | 5,000–15,000+ QA pairs | Synthetic and semi-automated question generation |
PathVQA | Pathology images | 32,799 QA pairs | Semi-automated QG, open-ended and yes/no; exam-like |
SLAKE | Mixed | ~14,000 QA pairs | Includes segmentation masks, knowledge graphs |
PMC-VQA | Multi-modality | 227,000 QA pairs | Diverse medical topics, automatic QA generation + filtering |
Domain-specific challenges include: (i) diversity in question formats and image modalities; (ii) scarcity of labeled data and high expert annotation cost; (iii) limited clinical context, e.g., lack of EHR or patient history; (iv) the need for interpretability and safety due to high-stakes deployment; (v) dataset biases, such as modality preference or label imbalance (Lin et al., 2021, Ye et al., 22 May 2025, Mishra et al., 9 Jul 2025).
2. Model Architectures and Methodological Advances
MedVQA architectures generally adopt a multi-stage framework involving (i) image and question encoders, (ii) multimodal fusion, (iii) answer prediction, and, increasingly, task-specific modules.
Encoders: Early systems relied on CNNs (VGG, ResNet, Inception-ResNet-v2) for image features and LSTM/Bi-LSTM/GRU for text. The field quickly shifted to domain-adapted transformers (BioBERT (Canepa et al., 2023), RadBERT, LLaMA-3 (Alsinglawi et al., 8 Apr 2025), CLIP/BiomedCLIP (Ha et al., 24 Apr 2024)) for higher-quality representations.
Fusion strategies:
- Simple fusion by concatenation or elementwise operations (Canepa et al., 2023)
- Attention-based fusion: Bilinear Attention Networks (BAN), co-attention, dual attention (Huang et al., 2022, Zhang et al., 28 Oct 2024), hierarchical cross-attention (Zhang et al., 4 Apr 2025), multi-view attention (image-to-question, word-to-text) (Pan et al., 2021)
Hierarchical, Causal, and Modular Extensions:
- Hierarchical frameworks (e.g., HQS-VQA (Gupta et al., 2020), HiCA-VQA (Zhang et al., 4 Apr 2025)): Questions classified by type (binary vs. descriptive, coarse vs. fine-grained) to route through specialized prediction modules.
- Causal modeling and debiasing: Utilization of structural causal models and counterfactual inference to mitigate modality preference bias and spurious correlations (Xu et al., 5 May 2025, Ye et al., 22 May 2025). Models learn or enforce P(A|do(I, Q)) rather than P(A|I, Q).
- Latent prompts & knowledge graph integration: LaPA introduces guided prompt generation plus a prior knowledge fusion module to enhance clinical relevance using graph neural networks (Gu et al., 19 Apr 2024).
- Graph-based Reasoning: Multi-modal relationship graphs (spatial, semantic, implicit) are constructed to capture fine-grained image-text relationships, improving interpretability (Hu et al., 2023).
- Retrieval-Augmented VQA: Advanced RAG systems leverage multimodal retrieval using grounded captions and optimal transport-based re-ranking to supply relevant clinical context to the VLM (Shaaban et al., 28 Jun 2025).
3. Evaluation Metrics, Error Analysis, and Model Limitations
Evaluation has predominantly relied on BLEU, accuracy, and word-overlap-based metrics. For closed-set tasks (e.g., Yes/No, organ type), strict correctness can be measured; however, for open-ended answers, automatic metrics often fail to capture semantic correctness, clinical synonymy, or partial credit (Gupta et al., 2020). Model error typologies include semantic near-misses, modality confusion, underspecified queries, and boundary omissions.
A notable limitation is the modality preference bias, where models over-rely on linguistic priors in question phrasing, especially prevalent in datasets with strong question-answer dependencies. Many models also exhibit hallucination—providing confident but clinically irrelevant or factually wrong answers, as exposed by purpose-built hallucination benchmarks (Wu et al., 11 Jan 2024).
Best practices increasingly incorporate interpretability audits (e.g., GradCAM, attention heatmaps), error categorization, and dedicated hallucination assessment protocols.
4. Experimental Outcomes and Empirical Findings
Across benchmarks, the adoption of task-aware hierarchical structures, multi-view attention mechanisms, and causal adjustment yields consistent improvements:
Model / Method | Dataset(s) | Overall Accuracy / Notable Gains | Feature |
---|---|---|---|
HQS-VQA (Gupta et al., 2020) | RAD, CLEF18 | BLEU: 0.132/0.411 RAD, superior to baselines | Hierarchical question segregation |
MuVAM (Pan et al., 2021) | VQA-RAD, Ph | Up to 74.3% (CP dataset) | Multi-view attention |
WSDAN (Huang et al., 2022) | VQA-RAD, CLEF19 | >83% closed Q, >76% overall (RAD) | Dual attention, sentence embedding |
LaPA (Gu et al., 19 Apr 2024) | VQA-RAD, SLAKE | Up to 84.73%, 1.8% improvement vs. ARL | Latent prompt fusion, GNN module |
Tri-VQA (Fan et al., 21 Jun 2024) | SLAKE, EUS | 0.831 (SLAKE), AUC 0.935 (EUS) | Causal triangle, multi-attribute |
OMniBAN (Zhang et al., 28 Oct 2024) | VQA-RAD | Up to 80.9% closed Q, 75.1% overall | Bilinear attention, low FLOPs |
MedCFVQA (Ye et al., 22 May 2025) | SLAKE-CP, RadVQA-CP | 0.892 (SLAKE), 0.430 (SLAKE-CP) | Counterfactual debiasing |
MOTOR (Shaaban et al., 28 Jun 2025) | MIMIC-CXR-VQA | +6.45% accuracy over baselines | OT re-ranking for retrieval |
Fusion of Domain-Adapted VLMs (Ha et al., 24 Apr 2024) | SLAKE, VQA-RAD | 87.5% (SLAKE), 73.2% (VQA-RAD) | LoRA, Med-CLIP, RadBloomz-7b |
MedThink (Gai et al., 18 Apr 2024) | R-RAD, R-SLAKE | 83.5% (R-RAD), 86.3% (R-SLAKE) | Rationale generation, explanation |
Empirical studies overwhelmingly support the benefit of integrating domain-specific pre-trained encoders (e.g., BiomedCLIP, BioBERT), hierarchical question segmentation, and causal intervention frameworks (Ha et al., 24 Apr 2024, Xu et al., 5 May 2025). Efficient fusion architectures such as OMniBAN enable performance on par with deep transformers but with substantially reduced resource requirements (Zhang et al., 28 Oct 2024).
5. Clinical Barriers and User Perspectives
Despite technical progress, multiple studies and clinician surveys reveal critical translational barriers for MedVQA in radiology:
- Up to 60% of QA pairs in public datasets are non-diagnostic or lack clinical relevance (Mishra et al., 9 Jul 2025).
- Integrative support for multi-view, multi-resolution input is absent in nearly all current systems, despite being a clinical standard for CT/MRI or radiographs.
- Only about 20% of reviewed datasets integrate domain/external knowledge (e.g., EHRs, ontology triplets), while clinical practice relies heavily on such patient context.
- Domain adaptation is often limited; approximately 87% of models rely on pre-training from general-domain data, diminishing clinical validity.
- Less than one-third of clinicians surveyed (29.8%) rated MedVQA systems as highly useful in practice, with widespread concerns about lack of patient-specific context, dataset curation, and the need for dialogue-based rather than single-turn Q&A (Mishra et al., 9 Jul 2025).
- Evaluation metrics currently used (accuracy, BLEU) are poorly aligned with clinician-perceived clinical correctness, safety, and interpretability.
Table: Selected Clinical Concerns from Survey (Mishra et al., 9 Jul 2025)
Concern | % Respondents |
---|---|
Lack of domain knowledge or EHR | 87.2% |
Preference for expert-curated datasets | 51.1% |
Need for multi-view support | 78.7% |
Preference for region/anatomy focus | 66% |
Preference for dialogue-style QA | 89.4% |
6. Future Directions, Open Issues, and Research Prospects
Several consensus directions emerge across the recent literature:
- Task Realism and Dataset Diversity: Expand beyond static, downsampled images and template QA pairs. Develop benchmarks with multi-view, multi-resolution, and real patient-EHR linkage. Encourage inclusion of clinical dialog and patient-centric reasoning (Lin et al., 2021, Mishra et al., 9 Jul 2025).
- Domain adaption and knowledge integration: Emphasize medical pre-training, multimodal EHR fusion, and ontology-aware learning frameworks.
- Causal and Counterfactual Reasoning: Broaden the adoption of explicit causal graph models, counterfactual inference, and debiasing of spurious language-image shortcuts (Xu et al., 5 May 2025, Ye et al., 22 May 2025, Fan et al., 21 Jun 2024).
- Interpretability and Dialogue: Develop systems generating rationales, supporting stepwise explanations and multi-turn dialogue, not just single-answer outputs (Gai et al., 18 Apr 2024). Integrate grounded visual evidence, attention heatmaps, and confidence measures.
- Workflow Integration and Metrics: Move toward clinical PACS/EHR integration, efficient and scalable models (e.g., OMniBAN), and design new evaluation metrics focusing on diagnostic validity, entity/relation correctness (similar to CheXbert/RadGraph), and interpretability (Mishra et al., 9 Jul 2025).
- Hallucination and Reliability: Systematically measure factual hallucinations, non-answering ability (e.g., “None of the above”), and the effects of role-based or “don’t hallucinate” prompting. Benchmarks and ablation studies to reduce spurious clinical claims are critical (Wu et al., 11 Jan 2024).
7. Summary
MedVQA has rapidly evolved from early CNN-LSTM classification-style models to highly specialized, hierarchy-aware, interpretable, and causally robust multimodal systems. While state-of-the-art methods surpass previous baselines in accuracy, generalizability, and efficiency, substantial technical and translational hurdles remain before routine clinical integration. Future research must address clinical context awareness, interpretability, and the development of evaluation paradigms that align with diagnostic safety and workflow demands for effective adoption in medical practice.