Kvasir-VQA-x1 Medical VQA Dataset
- Kvasir-VQA-x1 is a large-scale multimodal dataset combining 6,500 GI endoscopic images with over 220,000 QA pairs to advance medical VQA.
- The dataset introduces stratified QA complexity (levels 1 to 3) and leverages LLM-based prompt engineering for naturalized clinical queries.
- Incorporating robust image augmentations and following FAIR principles, Kvasir-VQA-x1 supports rigorous training and evaluation of clinical AI models.
Kvasir-VQA-x1 is a large-scale multimodal dataset designed to advance medical visual question answering (MedVQA) and reasoning applied to gastrointestinal (GI) endoscopy. By building upon and extending the original Kvasir-VQA resource, Kvasir-VQA-x1 introduces a substantial increase in question-answer (QA) pair complexity, diversity, and volume, enabling rigorous evaluation and robust training of multimodal AI systems for clinical decision support.
1. Dataset Expansion and Composition
Kvasir-VQA-x1 inherits the original 6,500 GI endoscopic images from HyperKvasir and Kvasir-Instrument and the associated ~58,849 QA pairs, then augments this base with 159,549 new, complex QA pairs. Every data entry contains an image identifier (img_id
), a natural language question probing a clinical aspect, a validated and naturalized answer, metadata tracking the original atomic QA pairs (in JSON), a complexity score (values ), and categorical tags for specific clinical content (e.g., polyp type, abnormality color, instrument presence).
The structure of Kvasir-VQA-x1 is illustrated in the following table:
Component | Quantity | Description |
---|---|---|
Images | 6,500 | Endoscopic GI visuals with clinical diversity |
Original QA | ~58,849 | Annotated atomic question-answer pairs |
New QA pairs | 159,549 | Stratified by complexity; merged, naturalized |
Complexity | 1, 2, or 3 | Reflects level of reasoning required per pair |
This systematic extension addresses the need for deeper clinical reasoning and richer multimodal supervision compared to prior datasets.
2. Question Types and Complexity Stratification
The QA pairs exhibit a comprehensive range of medical inquiry:
- Direct questions: yes/no, single-choice, multiple-choice
- Attribute probing: color, location, numerical count
- Compositional/multi-hop questions: Merged queries demand inference across several atomic facts, such as “What type of polyp is present and where is it located?” requiring integration of diagnosis and spatial understanding.
Complexity is stratified as follows:
- Level 1 (): Direct factual recall; single atomic QA
- Level 2 (): Moderate reasoning; merge two QA pairs
- Level 3 (): High-order reasoning; merge three QA pairs
This layered approach facilitates curriculum-based model training and benchmarks inference capabilities across a spectrum of reasoning depths.
3. Automated Generation Methodology
To expand the QA content, a structured pipeline was devised:
- QA grouping and filtering: Trivial questions excluded, atomic QA pairs grouped per image.
- Combinatorial sampling: Random selection of 1–3 distinct atomic QA pairs for merging.
- Prompt engineering with LLMs: Utilization of Qwen3-30B-A3B to synthesize coherent, clinically relevant questions from the sampled atomics. Answers are naturalized for fluency and appropriateness.
- Formatting enforcement: JSON-encodable output, categorical labeling, and consistency checks.
This methodology enables stratified complexity, natural language diversity, and scalable annotation applicable across the large image set.
4. Robustness via Visual Augmentation
Recognizing the variability and artifacts inherent in clinical imaging, Kvasir-VQA-x1 incorporates weak image augmentations:
- RandomResizedCrop: scaling between 0.9–1.0
- RandomRotation:
- RandomAffine: translation up to 10%
- ColorJitter: randomization of brightness and contrast
QA pairs are probabilistically paired: 77% with an augmented image, 23% with original images. This dual-track—“Original” and “Transformed”—enables explicit evaluation of model robustness to realistic perturbations found in clinical settings.
5. FAIR Data Principles and Accessibility
Kvasir-VQA-x1 adheres strictly to the FAIR data framework:
- Findable: Hosted on Hugging Face Datasets Hub.
- Accessible: Downloadable via web, Python API, and CLI.
- Interoperable: Standard JSON format compatible with ML pipelines.
- Reusable: Clear documentation, open licensing (CC BY-NC 4.0), supports broad research use.
All code and dataset artifacts are publicly available, promoting transparency, reproducibility, and community engagement.
6. Applications and Implications in Clinical AI
Kvasir-VQA-x1 supports:
- Training advanced vision-LLMs: MedGemma, Qwen2.5-VL, and similar architectures for clinical decision support.
- Evaluating diagnostic robustness: Stratified benchmarks to test reasoning complexity and handling of visual artifacts.
- Curriculum learning: Fine-grained evaluation of inference across complexity levels.
- Clinical interpretability: Targeting both straightforward and high-order reasoning tasks.
A plausible implication is that integrated MedVQA models, when trained on Kvasir-VQA-x1, can deliver improved diagnostic accuracy and reliability, reduce clinician workload, and facilitate remote or telemedicine consultation. The explicit robustness evaluation may further reduce error modes in automated endoscopy analysis.
7. Significance and Research Catalysis
Kvasir-VQA-x1 establishes a challenging, clinically realistic benchmark. By merging sophisticated LLM-based QA generation with complexity stratification and visual perturbation, it compels the development and assessment of reliable, trustworthy, explainable AI systems for gastroenterology. Its comprehensive documentation and public accessibility make it a catalytic resource for research advancing multimodal medical reasoning, robust clinical VQA, and ultimately safer and more effective AI deployment in medicine (Gautam et al., 11 Jun 2025).