Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
95 tokens/sec
Gemini 2.5 Pro Premium
55 tokens/sec
GPT-5 Medium
20 tokens/sec
GPT-5 High Premium
20 tokens/sec
GPT-4o
98 tokens/sec
DeepSeek R1 via Azure Premium
86 tokens/sec
GPT OSS 120B via Groq Premium
463 tokens/sec
Kimi K2 via Groq Premium
200 tokens/sec
2000 character limit reached

Kvasir-VQA-x1 Medical VQA Dataset

Updated 11 August 2025
  • Kvasir-VQA-x1 is a large-scale multimodal dataset combining 6,500 GI endoscopic images with over 220,000 QA pairs to advance medical VQA.
  • The dataset introduces stratified QA complexity (levels 1 to 3) and leverages LLM-based prompt engineering for naturalized clinical queries.
  • Incorporating robust image augmentations and following FAIR principles, Kvasir-VQA-x1 supports rigorous training and evaluation of clinical AI models.

Kvasir-VQA-x1 is a large-scale multimodal dataset designed to advance medical visual question answering (MedVQA) and reasoning applied to gastrointestinal (GI) endoscopy. By building upon and extending the original Kvasir-VQA resource, Kvasir-VQA-x1 introduces a substantial increase in question-answer (QA) pair complexity, diversity, and volume, enabling rigorous evaluation and robust training of multimodal AI systems for clinical decision support.

1. Dataset Expansion and Composition

Kvasir-VQA-x1 inherits the original 6,500 GI endoscopic images from HyperKvasir and Kvasir-Instrument and the associated ~58,849 QA pairs, then augments this base with 159,549 new, complex QA pairs. Every data entry contains an image identifier (img_id), a natural language question probing a clinical aspect, a validated and naturalized answer, metadata tracking the original atomic QA pairs (in JSON), a complexity score (values c{1,2,3}c \in \{1, 2, 3\}), and categorical tags for specific clinical content (e.g., polyp type, abnormality color, instrument presence).

The structure of Kvasir-VQA-x1 is illustrated in the following table:

Component Quantity Description
Images 6,500 Endoscopic GI visuals with clinical diversity
Original QA ~58,849 Annotated atomic question-answer pairs
New QA pairs 159,549 Stratified by complexity; merged, naturalized
Complexity 1, 2, or 3 Reflects level of reasoning required per pair

This systematic extension addresses the need for deeper clinical reasoning and richer multimodal supervision compared to prior datasets.

2. Question Types and Complexity Stratification

The QA pairs exhibit a comprehensive range of medical inquiry:

  • Direct questions: yes/no, single-choice, multiple-choice
  • Attribute probing: color, location, numerical count
  • Compositional/multi-hop questions: Merged queries demand inference across several atomic facts, such as “What type of polyp is present and where is it located?” requiring integration of diagnosis and spatial understanding.

Complexity is stratified as follows:

  • Level 1 (c=1c = 1): Direct factual recall; single atomic QA
  • Level 2 (c=2c = 2): Moderate reasoning; merge two QA pairs
  • Level 3 (c=3c = 3): High-order reasoning; merge three QA pairs

This layered approach facilitates curriculum-based model training and benchmarks inference capabilities across a spectrum of reasoning depths.

3. Automated Generation Methodology

To expand the QA content, a structured pipeline was devised:

  1. QA grouping and filtering: Trivial questions excluded, atomic QA pairs grouped per image.
  2. Combinatorial sampling: Random selection of 1–3 distinct atomic QA pairs for merging.
  3. Prompt engineering with LLMs: Utilization of Qwen3-30B-A3B to synthesize coherent, clinically relevant questions from the sampled atomics. Answers are naturalized for fluency and appropriateness.
  4. Formatting enforcement: JSON-encodable output, categorical labeling, and consistency checks.

This methodology enables stratified complexity, natural language diversity, and scalable annotation applicable across the large image set.

4. Robustness via Visual Augmentation

Recognizing the variability and artifacts inherent in clinical imaging, Kvasir-VQA-x1 incorporates weak image augmentations:

  • RandomResizedCrop: scaling between 0.9–1.0
  • RandomRotation: ±10\pm10^{\circ}
  • RandomAffine: translation up to 10%
  • ColorJitter: randomization of brightness and contrast

QA pairs are probabilistically paired: 77% with an augmented image, 23% with original images. This dual-track—“Original” and “Transformed”—enables explicit evaluation of model robustness to realistic perturbations found in clinical settings.

5. FAIR Data Principles and Accessibility

Kvasir-VQA-x1 adheres strictly to the FAIR data framework:

  • Findable: Hosted on Hugging Face Datasets Hub.
  • Accessible: Downloadable via web, Python API, and CLI.
  • Interoperable: Standard JSON format compatible with ML pipelines.
  • Reusable: Clear documentation, open licensing (CC BY-NC 4.0), supports broad research use.

All code and dataset artifacts are publicly available, promoting transparency, reproducibility, and community engagement.

6. Applications and Implications in Clinical AI

Kvasir-VQA-x1 supports:

  • Training advanced vision-LLMs: MedGemma, Qwen2.5-VL, and similar architectures for clinical decision support.
  • Evaluating diagnostic robustness: Stratified benchmarks to test reasoning complexity and handling of visual artifacts.
  • Curriculum learning: Fine-grained evaluation of inference across complexity levels.
  • Clinical interpretability: Targeting both straightforward and high-order reasoning tasks.

A plausible implication is that integrated MedVQA models, when trained on Kvasir-VQA-x1, can deliver improved diagnostic accuracy and reliability, reduce clinician workload, and facilitate remote or telemedicine consultation. The explicit robustness evaluation may further reduce error modes in automated endoscopy analysis.

7. Significance and Research Catalysis

Kvasir-VQA-x1 establishes a challenging, clinically realistic benchmark. By merging sophisticated LLM-based QA generation with complexity stratification and visual perturbation, it compels the development and assessment of reliable, trustworthy, explainable AI systems for gastroenterology. Its comprehensive documentation and public accessibility make it a catalytic resource for research advancing multimodal medical reasoning, robust clinical VQA, and ultimately safer and more effective AI deployment in medicine (Gautam et al., 11 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)