Papers
Topics
Authors
Recent
2000 character limit reached

Kvasir-SEG Dataset

Updated 25 December 2025
  • Kvasir-SEG is a publicly available dataset featuring 1,000 endoscopic colonoscopy images with expert-annotated binary segmentation masks and bounding boxes.
  • The dataset supports reproducible research by providing detailed, pixel-level annotations and standardized data splits for training, validation, and testing.
  • Baseline methods like ResUNet, HarDNet-MSEG, and PolypSeg-GradCAM demonstrate high segmentation performance with improved explainability via Grad-CAM overlays.

Kvasir-SEG is a publicly available, expert-annotated dataset designed for the purpose of pixel-level gastrointestinal polyp segmentation in endoscopic colonoscopy images. As an extension of the original Kvasir dataset, which featured only frame-level diagnostic labels, Kvasir-SEG introduces manually delineated binary segmentation masks and bounding box metadata, creating a reproducible resource for both classical and state-of-the-art data-driven algorithm development in computer vision for medical image analysis (Jha et al., 2019).

1. Dataset Composition and Annotation Protocol

Kvasir-SEG comprises 1,000 real clinical colonoscopy frames, each paired with a high-fidelity, single-channel (1-bit) binary segmentation mask that distinguishes polyp tissue (foreground, white) from non-polyp mucosa (background, black). The images are stored as JPEGs at their original acquisition resolutions, with no imposed resizing at the dataset level; resolutions span from 576×720 to 1920×1072 pixels, accommodating the heterogeneity of endoscopic imaging setups (Asare et al., 17 Sep 2025).

Polyp outlines were manually traced by a medical doctor and engineer team using the Labelbox platform and subsequently reviewed and verified by an experienced gastroenterologist for clinical rigor. An annotation export containing region-of-interest polygon coordinates was processed by script to generate the definitive binary masks. For each mask, a minimum-sized axis-aligned bounding box was computed and collected in a JSON file, using the format (“filename”, [x_min, y_min, x_max, y_max]). The directory structure separates images (/images/), masks (/masks/), and bounding box metadata (bboxes.json) (Jha et al., 2019).

Ground-truth masks are supplied as single-channel PNGs; background pixels have value 0, and polyp pixels have value 255 to align with widespread conventions in biomedical segmentation pipelines (Asare et al., 17 Sep 2025). The dataset captures diverse polyp morphologies, including pedunculated, sessile, flat, and varying size ranges (<5 mm to >10 mm), as well as challenging imaging conditions (e.g., strong reflections, mucosal folds, low contrast) (Asare et al., 17 Sep 2025).

2. Standardized Data Splits and Preprocessing Protocols

To facilitate reproducibility, the canonical split divides the 1,000 images into 800 for training, 100 for validation, and 100 for testing (80/10/10 stratification) based on variation in polyp size and appearance, though no patient-level de-duplication is guaranteed (Jha et al., 2019). Some subsequent works follow alternative splits, such as 880/120 or union with other datasets for expanded training (Huang et al., 2021).

For baseline CNN experiments, images and masks are typically resized to 320×320 or 256×256 using bicubic interpolation (images) and nearest-neighbor (masks), with normalization to the [0,1] interval. Augmentation strategies include random rotation (±15°), scaling, horizontal/vertical flipping, random cropping, brightness jitter, and occlusion-based schemes (cutout, random erasing) to simulate visual variance and mitigate overfitting (Jha et al., 2019, Asare et al., 17 Sep 2025).

3. Baseline Polyp Segmentation Approaches

Kvasir-SEG’s benchmarking initiative evaluates both classical and deep learning segmentation strategies:

3.1. Fuzzy C-Means (FCM) Clustering

A classical computer vision pipeline transforms each image to grayscale, applies median-based Otsu thresholding and edge enhancement, followed by morphological dilation. Pixels are clustered (C=2) by FCM for polyp vs. background assignment, operating image-wise without any learned parameters (Jha et al., 2019).

3.2. ResUNet

A deep CNN U-shape encoder-decoder architecture comprising five downsampling and upsampling blocks with residual connections; each block doubles feature depth downwards (64→128→256→512→1024) and reverses this pattern upwards. Nonlinearity is ReLU, with a final sigmoid activation for binary mask prediction. The network is trained using Dice loss and the Nadam optimizer (learning rate 1e-4, β₁=0.9, β₂=0.999), with thresholding at t=0.5 applied to the probability map for binarization (Jha et al., 2019).

3.3. PolypSeg-GradCAM U-Net

PolypSeg-GradCAM implements a four-stage U-Net (64→128→256→512 feature widths) with skip connections, trained using Adam (lr 1e-4) and a soft Dice loss; masks and images are resized to 256×256 and normalized. The architecture’s capability for explainability is enhanced via Grad-CAM overlays derived from the final decoder layer, spatially aligning activation heatmaps with regions influencing network prediction, thus supporting clinical verification (Asare et al., 17 Sep 2025).

3.4. HarDNet-MSEG

HarDNet-MSEG employs a HarDNet68 backbone (featuring the memory-efficient HarDBlock) and a cascaded partial decoder. The encoder achieves ∼30% faster inference compared to DenseNet/ResNet by minimizing memory traffic. Decoder design partially discards high-resolution shallow features, instead aggregating information from deeper layers via element-wise fusion. Receptive Field Blocks (RFB) on skip connections provide multi-scale context, benefiting boundary localization and small polyp detection. Input images are resized to 312×312 or 512×512, and extensive augmentation is used (Huang et al., 2021).

4. Evaluation Metrics and Quantitative Performance

Segmentation results in Kvasir-SEG studies are reported using the following metrics, with PP and GG as prediction and ground truth:

  • Intersection over Union (IoU): IoU=PGPG\mathrm{IoU} = \frac{|P \cap G|}{|P \cup G|}
  • Dice coefficient (F-score): Dice=2PGP+G\mathrm{Dice} = \frac{2|P \cap G|}{|P| + |G|}
  • Pixel accuracy: Accuracy=i1(Pi=Gi)total pixels\mathrm{Accuracy} = \frac{\sum_i \mathbf{1}(P_i = G_i)}{\mathrm{total\ pixels}}

Additional metrics reported in recent studies include precision, recall, F₂‐score, and overall accuracy (all measured on binarized predictions) (Huang et al., 2021, Asare et al., 17 Sep 2025):

Method Dice IoU Precision Recall Acc FPS
FCM 0.239 0.314
ResUNet 0.788 0.778 15
U-Net [ResNet34] 0.876 0.810 0.944 0.860 0.968 35
PraNet 0.898 0.840 66
HarDNet-MSEG 0.904 0.848 0.907 0.923 0.969 86.7
PolypSeg-GradCAM 0.961 0.926

The original ResUNet baseline (Dice 0.788, IoU 0.778) is significantly outperformed by subsequent models such as HarDNet-MSEG and PolypSeg-GradCAM. PolypSeg-GradCAM attains Dice ≈ 0.961 and IoU ≈ 0.926 on the test split, demonstrating robust performance on challenging polyp morphologies (Asare et al., 17 Sep 2025). HarDNet-MSEG further combines high dice (0.904) with real-time inference (86.7 FPS) (Huang et al., 2021).

5. Explainability, Failure Modes, and Clinical Relevance

Recent frameworks like PolypSeg-GradCAM introduce explainability for AI-based segmentation with Grad-CAM overlays, providing insight into the network’s attention during prediction. In straightforward cases, activation heatmaps align with annotated polyp boundaries; for ambiguous or flat polyps, attention occasionally diffuses into adjacent mucosal structures, highlighting scenarios requiring further clinical review (Asare et al., 17 Sep 2025). The prevalence of artifacts (specular reflections), class imbalance (polyps occupy <10% of pixels), and small object detection present persistent challenges, addressed via skip connections, soft Dice loss, and extensive data augmentation in these models.

6. Applications and Prospective Extensions

Kvasir-SEG underpins research in:

  • Benchmarking automated polyp segmentation: Establishing directly comparable evaluation of classical and deep learning methods.
  • Localization, morphological analysis, and measurement: Facilitating algorithms for polyp localization, size estimation, and shape context extraction.
  • Transfer, semi-supervised, and multimodal learning: Supporting use cases such as domain adaptation and learning leveraging both annotated and unannotated data.
  • Explainable AI in clinical contexts: Deployment of explainable segmentation (e.g., via Grad-CAM) aids in the clinical trust and acceptance of automated decision-support tools (Asare et al., 17 Sep 2025).

Proposed future directions include integration of additional polyp pathologies (ulcers, bleeding), multi-center and video-based expansions for temporal context, transformer-based and federated architectures for improved generalization, and advanced attribution mechanisms for fine-grained interpretability (Asare et al., 17 Sep 2025).

7. Impact and Availability

Kvasir-SEG is recognized as a carefully curated reference dataset for polyp segmentation, enabling reproducible research and accelerating the development of clinically viable, real-time endoscopy analysis tools. The dataset, code, and benchmarks are hosted at https://datasets.simula.no/kvasir-seg/, and its use is recommended with reference to the founding publication (Jha et al., MICCAI, LNCS) (Jha et al., 2019).

In summary, Kvasir-SEG provides a platform for robust, explainable, and clinically relevant algorithmic development in gastrointestinal image analysis, catalyzing advances in automated colorectal cancer screening technologies.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Kvasir-SEG.