Kvasir-VQA-x1 Medical VQA Dataset

Updated 11 August 2025

Kvasir-VQA-x1 is a large-scale multimodal dataset combining 6,500 GI endoscopic images with over 220,000 QA pairs to advance medical VQA.
The dataset introduces stratified QA complexity (levels 1 to 3) and leverages LLM-based prompt engineering for naturalized clinical queries.
Incorporating robust image augmentations and following FAIR principles, Kvasir-VQA-x1 supports rigorous training and evaluation of clinical AI models.

Kvasir-VQA-x1 is a large-scale multimodal dataset designed to advance medical visual question answering (MedVQA) and reasoning applied to gastrointestinal (GI) endoscopy. By building upon and extending the original Kvasir-VQA resource, Kvasir-VQA-x1 introduces a substantial increase in question-answer (QA) pair complexity, diversity, and volume, enabling rigorous evaluation and robust training of multimodal AI systems for clinical decision support.

1. Dataset Expansion and Composition

Kvasir-VQA-x1 inherits the original 6,500 GI endoscopic images from HyperKvasir and Kvasir-Instrument and the associated ~58,849 QA pairs, then augments this base with 159,549 new, complex QA pairs. Every data entry contains an image identifier (img_id), a natural language question probing a clinical aspect, a validated and naturalized answer, metadata tracking the original atomic QA pairs (in JSON), a complexity score (values $c \in \{1, 2, 3\}$ ), and categorical tags for specific clinical content (e.g., polyp type, abnormality color, instrument presence).

The structure of Kvasir-VQA-x1 is illustrated in the following table:

Component	Quantity	Description
Images	6,500	Endoscopic GI visuals with clinical diversity
Original QA	~58,849	Annotated atomic question-answer pairs
New QA pairs	159,549	Stratified by complexity; merged, naturalized
Complexity	1, 2, or 3	Reflects level of reasoning required per pair

This systematic extension addresses the need for deeper clinical reasoning and richer multimodal supervision compared to prior datasets.

2. Question Types and Complexity Stratification

The QA pairs exhibit a comprehensive range of medical inquiry:

Direct questions: yes/no, single-choice, multiple-choice
Attribute probing: color, location, numerical count
Compositional/multi-hop questions: Merged queries demand inference across several atomic facts, such as “What type of polyp is present and where is it located?” requiring integration of diagnosis and spatial understanding.

Complexity is stratified as follows:

Level 1 ( $c = 1$ ): Direct factual recall; single atomic QA
Level 2 ( $c = 2$ ): Moderate reasoning; merge two QA pairs
Level 3 ( $c = 3$ ): High-order reasoning; merge three QA pairs

This layered approach facilitates curriculum-based model training and benchmarks inference capabilities across a spectrum of reasoning depths.

3. Automated Generation Methodology

To expand the QA content, a structured pipeline was devised:

QA grouping and filtering: Trivial questions excluded, atomic QA pairs grouped per image.
Combinatorial sampling: Random selection of 1–3 distinct atomic QA pairs for merging.
Prompt engineering with LLMs: Utilization of Qwen3-30B-A3B to synthesize coherent, clinically relevant questions from the sampled atomics. Answers are naturalized for fluency and appropriateness.
Formatting enforcement: JSON-encodable output, categorical labeling, and consistency checks.

This methodology enables stratified complexity, natural language diversity, and scalable annotation applicable across the large image set.

4. Robustness via Visual Augmentation

Recognizing the variability and artifacts inherent in clinical imaging, Kvasir-VQA-x1 incorporates weak image augmentations:

RandomResizedCrop: scaling between 0.9–1.0
RandomRotation: $\pm10^{\circ}$
RandomAffine: translation up to 10%
ColorJitter: randomization of brightness and contrast

QA pairs are probabilistically paired: 77% with an augmented image, 23% with original images. This dual-track—“Original” and “Transformed”—enables explicit evaluation of model robustness to realistic perturbations found in clinical settings.

5. FAIR Data Principles and Accessibility

Kvasir-VQA-x1 adheres strictly to the FAIR data framework:

Findable: Hosted on Hugging Face Datasets Hub.
Accessible: Downloadable via web, Python API, and CLI.
Interoperable: Standard JSON format compatible with ML pipelines.
Reusable: Clear documentation, open licensing (CC BY-NC 4.0), supports broad research use.

All code and dataset artifacts are publicly available, promoting transparency, reproducibility, and community engagement.

6. Applications and Implications in Clinical AI

Kvasir-VQA-x1 supports:

Training advanced vision-LLMs: MedGemma, Qwen2.5-VL, and similar architectures for clinical decision support.
Evaluating diagnostic robustness: Stratified benchmarks to test reasoning complexity and handling of visual artifacts.
Curriculum learning: Fine-grained evaluation of inference across complexity levels.
Clinical interpretability: Targeting both straightforward and high-order reasoning tasks.

A plausible implication is that integrated MedVQA models, when trained on Kvasir-VQA-x1, can deliver improved diagnostic accuracy and reliability, reduce clinician workload, and facilitate remote or telemedicine consultation. The explicit robustness evaluation may further reduce error modes in automated endoscopy analysis.

7. Significance and Research Catalysis

Kvasir-VQA-x1 establishes a challenging, clinically realistic benchmark. By merging sophisticated LLM-based QA generation with complexity stratification and visual perturbation, it compels the development and assessment of reliable, trustworthy, explainable AI systems for gastroenterology. Its comprehensive documentation and public accessibility make it a catalytic resource for research advancing multimodal medical reasoning, robust clinical VQA, and ultimately safer and more effective AI deployment in medicine (Gautam et al., 11 Jun 2025).

PDF Markdown Chat (Pro)

References (1)

Kvasir-VQA-x1: A Multimodal Dataset for Medical Reasoning and Robust MedVQA in Gastrointestinal Endoscopy (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Kvasir-VQA-x1 Dataset.

Kvasir-VQA-x1 Medical VQA Dataset

1. Dataset Expansion and Composition

2. Question Types and Complexity Stratification

3. Automated Generation Methodology

4. Robustness via Visual Augmentation

5. FAIR Data Principles and Accessibility

6. Applications and Implications in Clinical AI

7. Significance and Research Catalysis

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Kvasir-VQA-x1 Medical VQA Dataset

1. Dataset Expansion and Composition

2. Question Types and Complexity Stratification

3. Automated Generation Methodology

4. Robustness via Visual Augmentation

5. FAIR Data Principles and Accessibility

6. Applications and Implications in Clinical AI

7. Significance and Research Catalysis

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research