EuropeMedQA Italian: Clinical VQA Dataset
- EuropeMedQA Italian is a clinical VQA dataset containing 60 curated, exam-derived multiple-choice questions paired with real medical images.
- The dataset requires integration of visual and textual inputs, using case vignettes and diagnostic images to prevent reliance on text-only shortcuts.
- Evaluation involves per-item accuracy and image ablation techniques, highlighting notable performance drops in models like GPT-4o when visual data is removed.
The EuropeMedQA Italian Dataset is a clinical visual question answering (VQA) resource designed to evaluate multimodal medical understanding in Italian LLMs. Constructed from real-world Italian State Exam questions for medical licensing (“SSM”), this dataset specifically targets scenarios requiring genuine interpretation of medical images, aiming to provide a robust benchmark for vision-LLMs (VLMs) in the context of clinical reasoning. It is distinguished by its exclusive focus on questions that are impossible to solve from text alone, necessitating integration of both textual and visual modalities (Felizzi et al., 24 Nov 2025).
1. Dataset Composition and Scope
The EuropeMedQA Italian SSM subset comprises 60 five-choice multiple-choice questions, each paired with a distinct medical image and sourced directly from official licensing exam materials. Each item consists of:
- A clinically realistic vignette in Italian
- One associated medical image (e.g., radiographs, CT/MRI, dermatological photos, ECGs, endoscopies)
- Five answer options labeled A–E
- A single correct answer, as specified by the exam’s official key
All questions were manually curated by experts from a broader EuropeMedQA SSM pool, with explicit inclusion criteria that the correct answer requires visual analysis not attainable via textual context alone. No public repository or download access to the image data is provided; images are used “as-is” and no details are specified for preprocessing, resolution, or format (Felizzi et al., 24 Nov 2025).
2. Specialty Coverage and Distribution
The 60-case set captures a cross-section of clinical specialties, reflecting real-world diagnostic diversity as seen in the Italian medical licensing exam. The coverage is as follows:
| Specialty | % of Questions | Approx. Count |
|---|---|---|
| Cardiology | 27% | ≈16 |
| Dermatology | 13% | ≈8 |
| Orthopedics | 12% | ≈7 |
| Neurology | 10% | ≈6 |
| Gastroenterology | 8% | ≈5 |
| Pulmonology | 8% | ≈5 |
| Preventive medicine/Epidemiology | 5% | ≈3 |
| Oncology | 3% | ≈2 |
| Hematology, Ophthalmology, Trauma Surgery | 2% each | ≈1–2 |
This distribution provides representation across both imaging-intensive and less image-dependent clinical fields. No further breakdowns (e.g., per-diagnosis frequency) or train/validation/test splits are provided; the collection functions solely as a single evaluation set (Felizzi et al., 24 Nov 2025).
3. Question Structure and Prompting
Questions are formatted in a standardized five-way multiple-choice style. For each, models are prompted with both the Italian-language clinical prompt—including contextual patient/clinical history—and the raw image. Chains-of-thought (CoT) prompting is used to elicit step-by-step reasoning, requiring not only an answer selection but a detailed diagnostic justification. For example:
Osservando l’ECG mostrato, quale diagnosi è più probabile? A) Infarto miocardico posteriore B) Ischemia subendocardica C) Infarto miocardico anteriore con elevazione del tratto ST D) Blocco atrioventricolare completo E) Pericardite acuta
Correct answer: C (anteriore ST-elevation MI)
This structure enforces multimodal reasoning constraints and supports granular analysis of model-generated clinical rationales (Felizzi et al., 24 Nov 2025).
4. Annotation, Quality Control, and Limitations
Ground truth labels derive directly from official exam answer keys. The selection of items for the dataset was performed manually by the authors to maximize the need for visual reasoning; however, inter-annotator agreement statistics, detailed quality control procedures, and additional annotation protocol information are not provided. Quality is implicitly controlled by adherence to the official key.
Notable limitations as acknowledged by the authors:
- Small scale (60 items), monolingual (Italian), and single-assessment-source coverage
- No splits by difficulty or consensus-based annotation
- Image-handling standards (e.g., file formats, preprocessing) are unspecified
- The evaluation of visual grounding is coarse, relying solely on blank image substitution as a negative control
- No public dataset release or casuistic membership-inference tests to rule out pre-training leakage (Felizzi et al., 24 Nov 2025)
5. Evaluation Protocol and Metrics
Model performance on the EuropeMedQA Italian Dataset is assessed via per-question accuracy, with each item evaluated over independent model runs:
- Per-item accuracy , where is the number of correct responses over runs
- Overall accuracy
- 95% confidence intervals computed via Student’s t-distribution:
- Visual dependency is operationalized as a drop in accuracy upon image substitution by a blank (non-informative) placeholder:
Additional metrics such as precision, recall, or F1 are not reported (Felizzi et al., 24 Nov 2025).
6. Comparison with Other Italian Medical QA Benchmarks
The EuropeMedQA Italian Dataset is distinct both in content and usage from broader Italian medical QA datasets such as MedBench-IT (Lazzaroni et al., 8 Sep 2025) and IMB (Romano et al., 21 Oct 2025):
- MedBench-IT covers 17,410 multiple-choice questions from preparatory (not licensing) materials, across six non-clinical subjects (Biology, Chemistry, Logic, etc.), but it excludes image-based QA entirely.
- IMB offers both extensive open-ended patient-provider dialogue (IMB-QA) and 25,862 multiple-choice specialty exam questions (IMB-MCQA). While IMB-MCQA covers a broader span of medical fields, it is not specifically designed for clinical visual question answering and does not systematically require image-based reasoning.
- EuropeMedQA-IT’s unique contribution is the controlled evaluation of image-text integration in medical VQA, with all problems curated to require visual input (Felizzi et al., 24 Nov 2025). A plausible implication is that compared to IMB or MedBench-IT, EuropeMedQA-IT addresses a narrower but more stringent multimodal diagnostic challenge for model evaluation.
7. Applications, Impact, and Prospects
The EuropeMedQA Italian Dataset is specifically tailored for fine-grained analysis of visual grounding in VLMs within the medical domain. It enables the systematic detection of unimodal textual shortcutting via image ablation. Empirical results using this benchmark demonstrate wide variance in the ability of leading VLMs to truly integrate image context, with only some models (e.g., GPT-4o) exhibiting substantial accuracy drops under image removal ( 27.9 percentage points), while others display minimal sensitivity, suggesting persistent reliance on textual priors. The dataset thus functions as a crucial tool for:
- Diagnosing shortcut risks in medical VLMs intended for clinical environments
- Quantitative benchmarking of multimodal grounding, beyond text-only medical QA
- Exposing structural weaknesses in current evaluation protocols for clinical decision-support models
Broader adoption or expansion would require larger-scale, multilingual, and more systematically annotated collections, incorporating granular difficulty controls, standardized preprocessing, and public image access. Nevertheless, the EuropeMedQA Italian Dataset establishes an essential, if limited, reference point for clinical VQA research in richly inflected non-English medical settings (Felizzi et al., 24 Nov 2025).