RadImageNet-VQA Dataset
- RadImageNet-VQA is a large-scale, expert-curated dataset for radiologic VQA, integrating CT and MRI images with diverse question-answer pairs.
- The dataset comprises 750,000 CT/MRI slices with 7.5 million samples, ensuring high linguistic diversity and robust evaluation through multi-task benchmarks.
- It supports anatomy recognition, abnormality detection, and pathology identification with rigorous metrics demonstrating significant gains from multimodal fine-tuning.
RadImageNet-VQA is a large-scale dataset specifically designed for advancing research in radiologic visual question answering (VQA) using computed tomography (CT) and magnetic resonance imaging (MRI) data. Addressing the limitations of prior medical VQA resources—which are typically dominated by X-ray imagery, small in scale, and susceptible to linguistic shortcuts—RadImageNet-VQA offers expert-curated, high-diversity question–answer pairs grounded in image content. The dataset supports three principal VQA tasks across a wide array of anatomic regions and pathologies, providing a robust benchmark for training and evaluating vision–LLMs (VLMs) capable of fine-grained radiologic reasoning (Butsanets et al., 19 Dec 2025).
1. Dataset Composition
RadImageNet-VQA contains 750,000 two-dimensional CT and MRI slices, comprising the CT/MRI subset of the original RadImageNet corpus (the precise breakdown between CT and MRI is not specified). Each image is paired with an average of nine VQA pairs and an aligned radiology caption. In total, the dataset provides 7.5 million samples, including 6.75 million question–answer (QA) pairs for VQA development and 750,000 captions for image–text alignment.
Three key VQA tasks are featured:
- Anatomy recognition: Identification of the imaged body region, with four question formats per image.
- Abnormality detection: Binary decision (yes/no) on the presence of any abnormal finding.
- Pathology identification: Fine-grained labeling of disease or lesion within the contextually correct anatomic region, with four question formats per image.
The dataset covers eight anatomical regions (abdomen/pelvis, ankle/foot, brain, hip, knee, chest, shoulder, spine) and 97 distinct pathology categories detailed in the source corpus.
| Data Modality | Images | QA Pairs | Captions |
|---|---|---|---|
| CT/MRI (2D slices) | 750,000 | 6,750,000 | 750,000 |
The training portion contains all 750,000 images and 7.5 million samples. The benchmark test set comprises 1,000 images (stratified by anatomy and pathology), resulting in 9,000 QA pairs with uniform distribution across tasks and question formats.
2. Annotation Process and Question Generation
Annotations originate from expert radiologist labels within RadImageNet for modality, anatomical region, and pathology. Each image receives a structured caption generated via templates that verbalize these dimensions (e.g., “A CT scan of the abdomen showing a pancreatic lesion.”). VQA pairs are constructed using expert-designed templates, yielding high linguistic diversity and discouraging exploitation of text-based artifacts.
For anatomy and pathology-related questions, each format (open-ended, closed-yes, closed-no, multiple-choice) receives 2–7 linguistic variants. Abnormality detection is restricted to yes/no templates. Multiple-choice distractors are sampled from either the same region (for pathology MC) or other regions (for anatomy MC), and all pathology MC questions include “no pathology seen” to reduce bias toward abnormal selections.
The QA pairs are generated by scripting pipelines and template banks, obviating the need for manual re-annotation. The design explicitly targets robustness against linguistic shortcuts, ensuring that correct answers require genuine image–text alignment.
3. Task Types, Pathology Coverage, and Example Interactions
RadImageNet-VQA’s task structure supports comprehensive evaluation of VLM image-grounding and clinical reasoning capabilities. Each image is paired with:
- Anatomy recognition (4 QAs): e.g., open-ended (“Which part of the body is shown in this MRI scan?” → “Knee”), closed-yes, closed-no, multiple choice.
- Abnormality detection (1 QA): closed-ended (“Does this image show any abnormal finding?” → “Yes”).
- Pathology identification (4 QAs): e.g., open-ended, “Is there a meniscal tear present?” (closed-no), MC (“Which of the following best describes the lesion in this CT of the abdomen? A) gallstone B) soft tissue mass C) bowel inflammation D) no pathology seen” → “B”).
Pathology coverage spans 97 categories, distributed across the eight anatomical regions. For example, the abdomen/pelvis region includes granular categories such as adrenal pathology, biliary dilation, gallstones, soft tissue mass, and urolithiasis, among others, supporting both general and highly specific VQA use cases.
4. Evaluation Metrics and Baseline Performance
Benchmarking employs exact-match accuracy for closed-ended and multiple-choice tasks. Open-ended questions are evaluated with an LLM-as-a-judge (Mistral-Large 2.1), applying a strict correct/incorrect rubric; mean accuracy per task and format is reported. The standard accuracy formula,
is referenced, with the focus on overall and per-task mean accuracy.
Results on the 9,000-pair test benchmark reveal:
- General-purpose VLMs (e.g., InternVL 3.5-14B): overall ~63.6% accuracy; anatomy MC 93.3–98.2%; abnormality (yes/no) 74.4%; pathology open-ended ~11.7%; pathology MC ~47.1%.
- Medical-oriented VLMs (e.g., Lingshu-7B): overall ~60.4%; pathology open-ended ~15.7%; pathology MC ~29.6%.
Text-only analysis indicates that on previous datasets (VQA-RAD, SLAKE), models can achieve 11–33% (open-ended) and well above random for MC, evidencing susceptibility to linguistic shortcuts. On RadImageNet-VQA, text-only results collapse to near-random (2–10% open-ended; ~25% MC), empirically demonstrating resistance to non-image-based reasoning shortcuts.
Fine-tuning on the multimodal corpus, including the full train split, yields substantial increases (Δ+19–22 points overall accuracy), with, for example, LLaVA-OneVision improving from ~57.5% to ~79.9% average accuracy; pathology open-ended from ~16% to ~42%; anatomy MC from ~88.7% to ~99.4%.
5. Data Access, Licensing, and Research Implications
RadImageNet-VQA is available for research use at https://huggingface.co/datasets/raidium/RadImageNet-VQA. The dataset is publicly released under terms specified by the authors, with researchers directed to consult the HuggingFace dataset card for licensing and permissible usage.
By its scale, expert annotation, multi-task structure, and empirical resistance to text-only shortcuts, RadImageNet-VQA constitutes a rigorous resource for developing and benchmarking radiology vision–LLMs. Its design directly addresses core VQA challenges in medical imaging, especially for CT and MRI, allowing precise assessment of model image-grounding and fine-grained pathology reasoning (Butsanets et al., 19 Dec 2025).
6. Position Within the Medical VQA Landscape
RadImageNet-VQA represents a substantial advance over prior medical VQA datasets, which have been limited by small scale, exclusive focus on X-ray or illustration-derived data, and vulnerability to exploitation of dataset artifacts. The dataset’s tests empirically show that vision–LLMs require genuine image understanding to achieve high performance, thus setting a new standard for rigorous VQA evaluation in radiology.
A plausible implication is that future developments in radiology VLMs will increasingly rely on such large-scale, expertly-annotated, linguistically-robust resources to achieve clinically relevant image–text integration and robust fine-grained understanding, particularly in open-ended radiologic question answering.