MedVQA: Visual Q&A in Medical Imaging

Updated 19 October 2025

MedVQA is an interdisciplinary field that combines computer vision and natural language processing to answer clinical questions based on medical images.
It employs multi-stage models with domain-adapted encoders and attention-based fusion to effectively integrate multimodal data.
Current research emphasizes causal reasoning, debiasing, and interpretability to enhance diagnostic accuracy and support clinical workflows.

Medical Visual Question Answering (MedVQA) is an interdisciplinary domain that focuses on enabling artificial intelligence systems to answer natural language questions about medical images. The objective is to deliver plausible, clinically relevant responses to automate or assist in diagnostic workflows, radiology report comprehension, disease screening, and educational applications. MedVQA builds on methodologies from computer vision, natural language processing, causal inference, and clinical informatics, integrating them to solve specific challenges presented by complex, multimodal medical data.

1. Task Definition, Datasets, and Domain Challenges

MedVQA requires a system to receive a medical image (such as a chest X-ray, MRI, or CT) and a natural language question, returning an answer that may range from single words (binary Yes/No), multiple-choice selections, or free-form descriptive responses (Lin et al., 2021).

A variety of benchmark datasets have been created (with differing scope, question complexity, and modality):

Dataset	Modality	QA Pairs & Size	Notable Features
VQA-RAD	Radiology (X-ray, CT, MRI)	3,515 QA pairs	Manually curated, free-form clinician questions
CLEF18/19/20/21	Broad (mostly radiology)	5,000–15,000+ QA pairs	Synthetic and semi-automated question generation
PathVQA	Pathology images	32,799 QA pairs	Semi-automated QG, open-ended and yes/no; exam-like
SLAKE	Mixed	~14,000 QA pairs	Includes segmentation masks, knowledge graphs
PMC-VQA	Multi-modality	227,000 QA pairs	Diverse medical topics, automatic QA generation + filtering

Domain-specific challenges include: (i) diversity in question formats and image modalities; (ii) scarcity of labeled data and high expert annotation cost; (iii) limited clinical context, e.g., lack of EHR or patient history; (iv) the need for interpretability and safety due to high-stakes deployment; (v) dataset biases, such as modality preference or label imbalance (Lin et al., 2021, Ye et al., 22 May 2025, Mishra et al., 9 Jul 2025).

2. Model Architectures and Methodological Advances

MedVQA architectures generally adopt a multi-stage framework involving (i) image and question encoders, (ii) multimodal fusion, (iii) answer prediction, and, increasingly, task-specific modules.

Encoders: Early systems relied on CNNs (VGG, ResNet, Inception-ResNet-v2) for image features and LSTM/Bi-LSTM/GRU for text. The field quickly shifted to domain-adapted transformers (BioBERT (Canepa et al., 2023), RadBERT, LLaMA-3 (Alsinglawi et al., 8 Apr 2025), CLIP/BiomedCLIP (Ha et al., 24 Apr 2024)) for higher-quality representations.

Fusion strategies:

Simple fusion by concatenation or elementwise operations (Canepa et al., 2023)
Attention-based fusion: Bilinear Attention Networks (BAN), co-attention, dual attention (Huang et al., 2022, Zhang et al., 28 Oct 2024), hierarchical cross-attention (Zhang et al., 4 Apr 2025), multi-view attention (image-to-question, word-to-text) (Pan et al., 2021)

Hierarchical, Causal, and Modular Extensions:

Hierarchical frameworks (e.g., HQS-VQA (Gupta et al., 2020), HiCA-VQA (Zhang et al., 4 Apr 2025)): Questions classified by type (binary vs. descriptive, coarse vs. fine-grained) to route through specialized prediction modules.
Causal modeling and debiasing: Utilization of structural causal models and counterfactual inference to mitigate modality preference bias and spurious correlations (Xu et al., 5 May 2025, Ye et al., 22 May 2025). Models learn or enforce P(A|do(I, Q)) rather than P(A|I, Q).
Latent prompts & knowledge graph integration: LaPA introduces guided prompt generation plus a prior knowledge fusion module to enhance clinical relevance using graph neural networks (Gu et al., 19 Apr 2024).
Graph-based Reasoning: Multi-modal relationship graphs (spatial, semantic, implicit) are constructed to capture fine-grained image-text relationships, improving interpretability (Hu et al., 2023).
Retrieval-Augmented VQA: Advanced RAG systems leverage multimodal retrieval using grounded captions and optimal transport-based re-ranking to supply relevant clinical context to the VLM (Shaaban et al., 28 Jun 2025).

3. Evaluation Metrics, Error Analysis, and Model Limitations

Evaluation has predominantly relied on BLEU, accuracy, and word-overlap-based metrics. For closed-set tasks (e.g., Yes/No, organ type), strict correctness can be measured; however, for open-ended answers, automatic metrics often fail to capture semantic correctness, clinical synonymy, or partial credit (Gupta et al., 2020). Model error typologies include semantic near-misses, modality confusion, underspecified queries, and boundary omissions.

A notable limitation is the modality preference bias, where models over-rely on linguistic priors in question phrasing, especially prevalent in datasets with strong question-answer dependencies. Many models also exhibit hallucination—providing confident but clinically irrelevant or factually wrong answers, as exposed by purpose-built hallucination benchmarks (Wu et al., 11 Jan 2024).

Best practices increasingly incorporate interpretability audits (e.g., GradCAM, attention heatmaps), error categorization, and dedicated hallucination assessment protocols.

4. Experimental Outcomes and Empirical Findings

Across benchmarks, the adoption of task-aware hierarchical structures, multi-view attention mechanisms, and causal adjustment yields consistent improvements:

Model / Method	Dataset(s)	Overall Accuracy / Notable Gains	Feature
HQS-VQA (Gupta et al., 2020)	RAD, CLEF18	BLEU: 0.132/0.411 RAD, superior to baselines	Hierarchical question segregation
MuVAM (Pan et al., 2021)	VQA-RAD, Ph	Up to 74.3% (CP dataset)	Multi-view attention
WSDAN (Huang et al., 2022)	VQA-RAD, CLEF19	>83% closed Q, >76% overall (RAD)	Dual attention, sentence embedding
LaPA (Gu et al., 19 Apr 2024)	VQA-RAD, SLAKE	Up to 84.73%, 1.8% improvement vs. ARL	Latent prompt fusion, GNN module
Tri-VQA (Fan et al., 21 Jun 2024)	SLAKE, EUS	0.831 (SLAKE), AUC 0.935 (EUS)	Causal triangle, multi-attribute
OMniBAN (Zhang et al., 28 Oct 2024)	VQA-RAD	Up to 80.9% closed Q, 75.1% overall	Bilinear attention, low FLOPs
MedCFVQA (Ye et al., 22 May 2025)	SLAKE-CP, RadVQA-CP	0.892 (SLAKE), 0.430 (SLAKE-CP)	Counterfactual debiasing
MOTOR (Shaaban et al., 28 Jun 2025)	MIMIC-CXR-VQA	+6.45% accuracy over baselines	OT re-ranking for retrieval
Fusion of Domain-Adapted VLMs (Ha et al., 24 Apr 2024)	SLAKE, VQA-RAD	87.5% (SLAKE), 73.2% (VQA-RAD)	LoRA, Med-CLIP, RadBloomz-7b
MedThink (Gai et al., 18 Apr 2024)	R-RAD, R-SLAKE	83.5% (R-RAD), 86.3% (R-SLAKE)	Rationale generation, explanation

Empirical studies overwhelmingly support the benefit of integrating domain-specific pre-trained encoders (e.g., BiomedCLIP, BioBERT), hierarchical question segmentation, and causal intervention frameworks (Ha et al., 24 Apr 2024, Xu et al., 5 May 2025). Efficient fusion architectures such as OMniBAN enable performance on par with deep transformers but with substantially reduced resource requirements (Zhang et al., 28 Oct 2024).

5. Clinical Barriers and User Perspectives

Despite technical progress, multiple studies and clinician surveys reveal critical translational barriers for MedVQA in radiology:

Up to 60% of QA pairs in public datasets are non-diagnostic or lack clinical relevance (Mishra et al., 9 Jul 2025).
Integrative support for multi-view, multi-resolution input is absent in nearly all current systems, despite being a clinical standard for CT/MRI or radiographs.
Only about 20% of reviewed datasets integrate domain/external knowledge (e.g., EHRs, ontology triplets), while clinical practice relies heavily on such patient context.
Domain adaptation is often limited; approximately 87% of models rely on pre-training from general-domain data, diminishing clinical validity.
Less than one-third of clinicians surveyed (29.8%) rated MedVQA systems as highly useful in practice, with widespread concerns about lack of patient-specific context, dataset curation, and the need for dialogue-based rather than single-turn Q&A (Mishra et al., 9 Jul 2025).
Evaluation metrics currently used (accuracy, BLEU) are poorly aligned with clinician-perceived clinical correctness, safety, and interpretability.

Table: Selected Clinical Concerns from Survey (Mishra et al., 9 Jul 2025)

Concern	% Respondents
Lack of domain knowledge or EHR	87.2%
Preference for expert-curated datasets	51.1%
Need for multi-view support	78.7%
Preference for region/anatomy focus	66%
Preference for dialogue-style QA	89.4%

6. Future Directions, Open Issues, and Research Prospects

Several consensus directions emerge across the recent literature:

Task Realism and Dataset Diversity: Expand beyond static, downsampled images and template QA pairs. Develop benchmarks with multi-view, multi-resolution, and real patient-EHR linkage. Encourage inclusion of clinical dialog and patient-centric reasoning (Lin et al., 2021, Mishra et al., 9 Jul 2025).
Domain adaption and knowledge integration: Emphasize medical pre-training, multimodal EHR fusion, and ontology-aware learning frameworks.
Causal and Counterfactual Reasoning: Broaden the adoption of explicit causal graph models, counterfactual inference, and debiasing of spurious language-image shortcuts (Xu et al., 5 May 2025, Ye et al., 22 May 2025, Fan et al., 21 Jun 2024).
Interpretability and Dialogue: Develop systems generating rationales, supporting stepwise explanations and multi-turn dialogue, not just single-answer outputs (Gai et al., 18 Apr 2024). Integrate grounded visual evidence, attention heatmaps, and confidence measures.
Workflow Integration and Metrics: Move toward clinical PACS/EHR integration, efficient and scalable models (e.g., OMniBAN), and design new evaluation metrics focusing on diagnostic validity, entity/relation correctness (similar to CheXbert/RadGraph), and interpretability (Mishra et al., 9 Jul 2025).
Hallucination and Reliability: Systematically measure factual hallucinations, non-answering ability (e.g., “None of the above”), and the effects of role-based or “don’t hallucinate” prompting. Benchmarks and ablation studies to reduce spurious clinical claims are critical (Wu et al., 11 Jan 2024).

7. Summary

MedVQA has rapidly evolved from early CNN-LSTM classification-style models to highly specialized, hierarchy-aware, interpretable, and causally robust multimodal systems. While state-of-the-art methods surpass previous baselines in accuracy, generalizability, and efficiency, substantial technical and translational hurdles remain before routine clinical integration. Future research must address clinical context awareness, interpretability, and the development of evaluation paradigms that align with diagnostic safety and workflow demands for effective adoption in medical practice.