Kvasir Dataset: GI Endoscopy Imaging

Updated 7 July 2025

Kvasir Dataset is a curated collection of annotated gastrointestinal endoscopic images covering key anatomical landmarks and pathologies for tasks like classification, segmentation, and VQA.
It includes specialized extensions such as Kvasir-SEG, Kvasir-VQA, and Kvasir-Capsule, offering precise annotations like segmentation masks and question–answer pairs for robust model benchmarking.
The dataset advances AI research by providing high-quality, expert-validated images that support innovative methodologies in diagnostic accuracy, anomaly detection, and multimodal reasoning.

The Kvasir dataset is a collection of annotated gastrointestinal (GI) endoscopic images, designed to support a broad range of research efforts in computer vision, machine learning, and medical image analysis. Developed and continuously extended since its initial release, the Kvasir family of datasets has become a cornerstone resource for benchmarking models in classification, segmentation, anomaly detection, and multimodal reasoning within GI endoscopy.

1. Dataset Composition and Structure

The original Kvasir dataset consists of 4,000 colored endoscopic images, balanced across 8 classes that represent anatomical landmarks (including the Z-line, pylorus, and cecum) and common GI pathologies (esophagitis, polyps, ulcerative colitis, dyed lifted polyps, and dyed resection margins). Each class contains 500 images, with resolutions ranging from 720×576 up to 1920×1072 pixels. All images were labeled and validated by expert endoscopists to ensure diagnostic precision and consistency (1712.03689).

Subsequent expansions resulted in larger datasets:

Kvasir-SEG: An extension including 1,000 images of polyps with binary pixel-wise ground truth segmentation masks and bounding boxes. Masks were manually annotated and verified by a gastroenterologist (1911.07069).
Kvasir-Instrument: Comprising 590 endoscopic frames annotated with segmentation masks and bounding boxes for various GI instruments (snares, forceps, balloons, etc.), verified by expert GI endoscopists (2011.08065).
Kvasir-Capsule: A collection of approximately 47,238 labeled frames (plus millions of unlabeled frames) covering a range of capsule endoscopy findings, including several rare and challenging classes (2504.06039).
Kvasir-VQA: Integrates 6,500 GI images, each paired with multi-type question–answer annotations for Visual Question Answering (VQA), image captioning, and object detection (2409.01437).
Kvasir-VQA-x1: Further augments the VQA corpus with 159,549 stratified, complexity-graded question–answer pairs alongside systematic visual perturbations, facilitating multimodal and robust reasoning model development (2506.09958).

The datasets are distributed via public repositories with clear metadata, making them accessible for academic research.

2. Annotation Protocols and Ground Truth Creation

Annotation protocols for Kvasir datasets emphasize clinical accuracy and reproducibility. For class labels (classification tasks), each image is reviewed and assigned to a diagnostic or anatomical category by medical experts. For segmentation, manual pixel-wise annotations are created using tools like Labelbox by engineers and medical doctors, followed by review from experienced gastroenterologists. Segmentation masks are binary images (1-bit depth) indicating polyp or instrument regions, and bounding boxes are provided in structured JSON formats (1911.07069, 2011.08065).

The Kvasir-VQA and Kvasir-VQA-x1 datasets extend annotation to include natural language question–answer pairs, generated and validated both via LLMs and clinical domain experts. Question complexity is stratified, and answer naturalization ensures human-likeness and contextual correctness (2409.01437, 2506.09958).

This rigorous approach to annotation supports detailed benchmarking of both classical and modern AI models and guarantees the reliability of ground truth in clinical research.

3. Representative Research Uses and Methodologies

The Kvasir datasets have been leveraged for various methodological advances:

Classification: Early works applied transfer learning with Inception v3, fine-tuning on Kvasir using data augmentation (random rotation, shifting, shearing, zoom, and flips), achieving a precision, recall, F1, and accuracy of 0.915 in an 8-class setting (1712.03689).
Segmentation: Advanced models such as ResUNet++, DPE-Net, Meta-Polyp, and Med-2D SegNet have been evaluated on Kvasir-SEG, showing increasing Dice and mIoU scores (up to 0.959 Dice and 0.921 mIoU for Meta-Polyp; 0.9578 Dice for Med-2D SegNet), with architectural innovations including dual-parallel encoders, multi-scale upsampling, residual connections, and contrastive adaptors (1911.07067, 2412.00888, 2305.07848, 2504.14715).
Anomaly Detection: Ensemble learning strategies, particularly in capsule endoscopy (Kvasir-Capsule), have employed combinations of autoencoders and supervised models, using losses such as cross-entropy and MSE, yielding an AUC of 76.86% with reduced parameter counts (2504.06039).
Open Set Recognition (OSR): Techniques like OpenMax have been applied to Kvasir to benchmark models’ capabilities to reject unknown classes in clinical scenarios, with ResNet-50+OpenMax achieving 86.3% OSR accuracy and 94.7% AUROC (2506.18284).
Self-Supervised and Foundation-Model Approaches: Methods involving Barlow Twins contrastive learning, self-supervised pretext tasks (e.g., inpainting for U-Net), and adaptation of foundation models (e.g., SAM with contrastive adaptors) are applied to overcome the labeled data scarcity and improve domain adaptation in Kvasir and its derivatives (2303.01672, 2110.08776, 2403.10820, 2408.05936).
Multimodal and VQA Tasks: Kvasir-VQA/-x1 enables evaluation of image captioning (e.g., BLEU scores of 0.0823), VQA, and object detection using standardized metrics (BLEU, ROUGE, CIDEr, IS, FID), supporting clinical reasoning and robustness validation under visual artefacts (2409.01437, 2506.09958).

Many studies adopt common evaluation protocols, such as train/val/test splits (typically 70/10/20 or 80/10/10), and use data augmentation to improve robustness and generalization.

4. Benchmarking and Performance Metrics

Performance is commonly assessed using:

Classification: Precision, recall, F1-score, specificity, accuracy, MCC, and ROC-AUC (2301.02390, 2304.11529, 2402.02274).
Segmentation: Dice coefficient (DSC), mean Intersection over Union (mIoU), and for ensemble models, AUC (1911.07067, 2101.04001, 2110.08776, 2305.07848, 2412.00888, 2504.14715).
VQA and Captioning: BLEU, ROUGE, METEOR, CIDEr, FID, and IS (2409.01437, 2506.09958).

Empirical studies document substantial improvements as models evolve. For example, ResUNet++ raised the Dice score over U-Net from approximately 71.47% to 81.33% on Kvasir-SEG (1911.07067), and Med-2D SegNet further advanced to 95.78% DSC with just 2 million parameters (2504.14715). Real-time models (e.g., HarDNet-MSEG at 86.7 FPS) demonstrate feasibility for clinical integration (2101.07172).

5. Extensions and Multimodal Datasets

The Kvasir collection’s growth reflects expanding research needs:

Kvasir-SEG and Kvasir-Instrument address the lack of pixel-wise segmentation data for automated polyp and instrument detection (1911.07069, 2011.08065).
Kvasir-Capsule introduces challenges of scale, severe class imbalance, and the need for anomaly detection in capsule endoscopy (2504.06039).
Kvasir-VQA and VQA-x1 foster development of multimodal vision–LLMs for complex medical reasoning and robust VQA, with layered question complexity and visual perturbations simulating clinical artefacts (2409.01437, 2506.09958).

These additions enable comprehensive training and robust evaluation of AI systems for interdisciplinary applications, from real-time clinical assistance to research in medical reasoning and natural language generation.

6. Impact and Clinical Significance

The Kvasir datasets have directly advanced the state-of-the-art in:

Automated Detection and Diagnosis: Enabling robust AI models for detection, localization, and classification of GI pathologies, supporting reduction in polyp miss rates and enhancing early cancer detection (1712.03689, 1911.07067, 2412.00888).
Formative Datasets for Model Development: Supplying realistic, expertly annotated image collections that reflect clinical diversity, facilitating reproducibility and meaningful benchmarking (1911.07069, 2011.08065, 2304.11529).
Advancing Multimodal Reasoning: Providing structured VQA datasets for the development of AI systems capable of complex, context-aware medical question answering and diagnostic support (2409.01437, 2506.09958).
Clinical Safety and Robustness: Supporting research in open set recognition, anomaly detection, and label correction to mitigate risks of model overconfidence in unseen or ambiguous cases (2506.18284, 2403.10820).

The datasets’ applicability to zero-shot learning, cross-dataset generalization, and active label correction has further broadened their practical utility for both academic and clinical settings (2504.14715, 2403.10820).

7. Ongoing Challenges and Future Directions

Despite their widespread adoption, several challenges persist:

Class Imbalance and Data Scarcity: Capsule and rare pathology classes remain underrepresented; methods such as advanced augmentation, focal loss, and self-supervised learning are important for mitigation (2303.01672, 2504.06039).
Segmentation of Small Lesions and Fine Boundaries: Developing architectures capable of precise delineation of subtle and small polyps remains a key research focus (2305.07848, 2408.05936).
Robustness to Artefacts and Open World Settings: New datasets (e.g., Kvasir-VQA-x1) take steps to address real-world artefacts and out-of-distribution samples, setting benchmarks for model robustness (2506.09958, 2506.18284).
Clinical Integration and Interpretability: Although explainability techniques like Grad-CAM and SHAP have been applied (2301.02390), further efforts are necessary to ensure trustworthiness and regulatory compliance in deployment.

Planned expansions—such as additional disease classes, further VQA question complexity, and multi-institutional data collection—are anticipated. This ongoing development, along with the public and FAIR-compliant release model, positions Kvasir datasets as essential infrastructure for future advances in medical imaging AI.