MEDIQA-WV 2025: Multimodal Wound Care VQA

Updated 20 November 2025

MEDIQA-WV 2025 Shared Task is a benchmark challenge in medical NLP that tasks AI with generating clinically coherent text and structured metadata from wound care queries with images.
The task leverages a curated dataset of 477 cases with 748 images and comprehensive annotations across seven wound attributes to support remote care applications.
Evaluation integrates traditional metrics and advanced LLM-based scores, using retrieval-augmented and metadata-guided methods to enhance clinical reliability and reduce hallucinations.

The MEDIQA-WV 2025 Shared Task is a benchmark challenge in medical multimodal natural language processing, specifically targeting the generation of free-text and structured responses to wound care queries with paired clinical images. The task catalyzes the development of AI systems capable of multimodal reasoning in asynchronous remote care, with an emphasis on reliability, clinical precision, and schema conformity.

1. Task Definition and Dataset

The MEDIQA-WV 2025 Shared Task is centered on wound-care Visual Question Answering (VQA), requiring systems to generate clinically coherent free-text responses and extract wound metadata from patient queries accompanied by one or more wound images. The task addresses the need for scalable AI assistants in remote wound management, where provider workload is intensified by asynchronous care models.

The official dataset, WoundcareVQA, comprises 477 cases with 748 wound images and 768 expert responses. Each case includes a natural language patient query and corresponding expert response, with an average query length of 44–52 words and response length of 29–47 words. Rich metadata annotation is supplied for seven wound attributes: Anatomic Location, Wound Type, Wound Thickness, Tissue Color, Drainage Type, Drainage Amount, and Signs of Infection. Inter-annotator agreement (IAA) for these attributes ranges from 0.81 to 1.0, with higher reliability for wound type and tissue color (Durgapraveen et al., 13 Nov 2025, Karim et al., 12 Oct 2025).

2. Evaluation Metrics and Leaderboard Structure

To comprehensively evaluate system outputs, the organizers employ a composite average of ten metrics: deltaBLEU (dBLEU), ROUGE-1/2/L/Lsum, BERTScore (mean and max reference), and three independent LLM-based plausibility scores (DeepSeek-V3-0324, Gemini-1.5-pro-002, and GPT-4o). The average score for each submission is calculated as the arithmetic mean of the ten metrics:

$\text{Avg} = \frac{\mathrm{dBLEU} + \mathrm{R1} + \mathrm{R2} + \mathrm{RL} + \mathrm{RLsum} + \mathrm{BERT{\text-}mn} + \mathrm{BERT{\text-}mx} + \mathrm{DeepSeek} + \mathrm{Gemini} + \mathrm{GPT4o}}{10}$

The leaderboard reflects both surface-form overlap (deltaBLEU, ROUGE, BERTScore) and clinical appropriateness as judged by high-fidelity LLM evaluators. For reference, the MasonNLP system ranked 3rd with an average score of 41.37% across 19 teams and 51 submissions (Karim et al., 12 Oct 2025); the EXL Health AI Lab system achieved a top deltaBLEU of 13.04 on mined prompting configurations (Durgapraveen et al., 13 Nov 2025).

System	deltaBLEU	ROUGE-L	BERTScore	Avg
EXL Services–Health	9.92	45.61	62.18	47.30
MasonNLP (RAG)	8.89	42.19	59.01	41.37
LLaMA-4 (zero-shot baseline)	1.73	14.00	29.00	14.10

3. Mined Prompting and Retrieval-Augmented Generation

Mined prompting leverages semantic retrieval of training examples to enhance multimodal LLM outputs. The approach involves embedding all training queries (with associated image prompts) using a sentence transformer, such as all-mpnet-base-v2 (768-dimensional), and retrieving the top-k most similar examples via cosine similarity:

$\mathrm{sim}(x_i, x_j) = \frac{e_i \cdot e_j}{\|e_i\| \|e_j\|}$

The optimal number of retrieved shots (k) is model-dependent, with ablations indicating $k=25$ for InternVL3-38B and $k=5$ for MedGemma-27B (Durgapraveen et al., 13 Nov 2025). Prompt construction concatenates a system instruction, k demonstration blocks (query, image description, expert response), and the new patient query. This retrieval-augmented in-context learning setup yields significant metric improvements: for example, MedGemma-27B 5-shot prompting attained a 191% improvement in deltaBLEU over the full-metadata baseline (13.04 vs. 4.48).

An alternative formulation, as explored by MasonNLP, employs multimodal retrieval-augmented generation (RAG) with both text (all-MiniLM-L6-v2) and image (CLIP ViT-B/32) embeddings. Similarity is computed with a weighted sum of text and image similarities $(\alpha=0.5)$ , and the top two exemplars are fused into the prompt for generation with LLaMA-4. This framework notably reduces hallucinations, out-of-vocabulary label rates, and template-like responses (Karim et al., 12 Oct 2025).

4. Metadata-Guided Generation

Metadata-guided generation augments prompts with structured wound attributes predicted automatically from the input. A metadata ablation identified four high-impact features—Anatomic Location, Wound Type, Drainage Type, Tissue Color.

Attribute prediction is formulated as few-shot classification: MedGemma-27B, a 27B-parameter multimodal LLM with MedSigLIP image encoder, is prompted with five domain-expert demonstrations per class. Each predicted attribute is accompanied by a calibrated confidence $c_j \in [0,1]$ . For each attribute, if $c_j \geq \tau$ (with $\tau=0.7$ ), its value is included as a factual prompt statement; otherwise, a hedged statement is given ("Findings suggest possible $A_j=\hat{y}_j$ ; confirm clinically.").

Inclusion of metadata enhances clinical coherence, particularly in attributes with high IAA, though overall lexical overlap remains modest (metadata-guided deltaBLEU = 5.70) (Durgapraveen et al., 13 Nov 2025). Dynamic integration of high-confidence attributes positions the approach for safety-critical deployment, where overconfident factual assertions are avoided in ambiguous cases.

5. Model Architectures and Prompt Engineering

The principal model backbones explored in the challenge were MedGemma-27B and InternVL3-38B, both multimodal LLMs with ViT-based vision modules, and, in the case of MasonNLP, meta-llama/Llama-4-Scout-17B-16E-Instruct with integrated vision processing. Systems used prompt-based adaptation exclusively; no additional fine-tuning was conducted on shared-task data.

Prompt engineering played a critical role. For mined prompting, inclusion of multiple in-domain exemplars increased performance, with surface metrics peaking at model-specific $k$ . For metadata-guided approaches, prompt templates conditionally inserted high-confidence factual statements or hedged text regarding wound attributes, directly affecting downstream informativeness and plausibility as assessed by LLM-judges.

6. Error Modes, Limitations, and Future Directions

Despite substantial gains over baselines, challenges persist. Zero-shot models frequently produce invalid JSON structures, schema violations, or hallucinated findings such as unwarranted infection diagnoses (reduced from 33% to 6.5% by RAG) (Karim et al., 12 Oct 2025). Retrieval-augmented prompting and metadata guidance reduce these errors but overall deltaBLEU remains below 14, highlighting inherent task complexity (Durgapraveen et al., 13 Nov 2025).

Limitations include moderate IAA for certain metadata (location, drainage) leading to less reliable attribute prediction, and lack of full benchmarks for combined mined prompting with metadata-guided generation. Further improvement is plausible with specialized wound image embeddings, diversity-aware retrieval techniques, formal classifier fine-tuning, and richer clinical correctness/safety evaluation.

Future work includes hybrid architectures integrating both semantically matched exemplars and high-confidence metadata, broader retrieval from clinical wound-care corpora, and advanced prompt re-ranking (Durgapraveen et al., 13 Nov 2025, Karim et al., 12 Oct 2025).

7. Significance and Broader Impact

The MEDIQA-WV 2025 Shared Task serves as a reference benchmark for multimodal clinical reasoning and patient communication, promoting the development of data-efficient, prompt-centric strategies for high-stakes medical NLP. Approaches pioneered—mined few-shot prompting, metadata-guided prompt assembly, lightweight multimodal RAG—demonstrate clear, reproducible gains and serve as strong baselines for real-world asynchronous telemedicine applications. These paradigms are extensible to other domains requiring multimodal VQA with stringent schema and clinical fidelity requirements (Durgapraveen et al., 13 Nov 2025, Karim et al., 12 Oct 2025).

PDF Markdown Chat (Pro)

References (2)

Mined Prompting and Metadata-Guided Generation for Wound Care Visual Question Answering (2025)

Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to MEDIQA-WV 2025 Shared Task.

MEDIQA-WV 2025: Multimodal Wound Care VQA

1. Task Definition and Dataset

2. Evaluation Metrics and Leaderboard Structure

3. Mined Prompting and Retrieval-Augmented Generation

4. Metadata-Guided Generation

5. Model Architectures and Prompt Engineering

6. Error Modes, Limitations, and Future Directions

7. Significance and Broader Impact

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

MEDIQA-WV 2025: Multimodal Wound Care VQA

1. Task Definition and Dataset

2. Evaluation Metrics and Leaderboard Structure

3. Mined Prompting and Retrieval-Augmented Generation

4. Metadata-Guided Generation

5. Model Architectures and Prompt Engineering

6. Error Modes, Limitations, and Future Directions

7. Significance and Broader Impact

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research