Polyp Detection & Segmentation

Updated 5 May 2026

Polyp detection and segmentation is the process of identifying and delineating polyp regions in colonoscopy images using advanced deep learning techniques.
State-of-the-art approaches integrate CNNs, transformers, and hybrid architectures to enhance boundary precision, real-time performance, and resilience against imaging artifacts.
Robust datasets and standardized metrics like Dice and IoU drive benchmarking and clinical translation, addressing challenges such as domain shifts and small/flat polyp detection.

Automatic polyp detection and segmentation are critical components in the prevention and early diagnosis of colorectal cancer, enabling real-time assistance during colonoscopy and supporting clinical decision-making by providing precise region delineations. Polyp segmentation, defined as the pixel-wise identification of polyp regions within colonoscopic images or videos, has evolved rapidly with the advent of deep learning, shifting from hand-crafted feature systems to hybrid architectures capable of robust, real-time performance across diverse clinical settings. This article synthesizes significant developments, methodologies, datasets, architectural paradigms, benchmarks, and remaining challenges in polyp detection and segmentation, with an emphasis on state-of-the-art (SOTA) research.

1. Problem Definition and Clinical Motivation

Automatic polyp detection refers to the localization of candidate polyp regions (often via bounding boxes), whereas segmentation provides dense binary masks corresponding to polyp tissue in image or video frames. Their synergistic integration supports both diagnostic (e.g., polyp size/shape analysis) and interventional workflows (e.g., margin delineation for resection). The clinical imperative is underscored by high miss rates, especially for small, flat, or serrated polyps, which remain a significant risk factor for interval cancers. Automated systems must contend with substantial heterogeneity—polyps vary widely in size, texture, and morphology, and colonoscopy images are fraught with artifacts (motion blur, fluid occlusion, specular highlights), variable illumination, and inter-patient/device domain shifts [2311.18373].

2. Datasets, Annotation Protocols, and Evaluation

Robust evaluation depends on large-scale, representative datasets with quality-assured annotations.

2.1 Key Datasets

Kvasir-SEG: 1,000 images with expert-delineated masks; stratified by polyp size.
CVC-ClinicDB, CVC-ColonDB, ETIS: Widely used for benchmarking; contain 612, 380, and 196 images, respectively.
PolypGen [2106.04463]: 8,037 frames from six centers; includes single images and short video sequences with 3,762 labeled polyps; designed for generalizability studies and domain shift assessment.
PolypDB [2409.00045]: 3,934 images from three continents, spanning five imaging modalities (WLI, NBI, LCI, BLI, FICE), and enabling federated learning experiments.
SUN-SEG, PICCOLO, PolypSegm-ASH: >150,000 annotated video frames and large static image sets for foundation model training and cross-domain evaluation [2503.24138].

Annotation is generally performed using dedicated platforms (e.g., Labelbox, Wacom tablets), with multi-expert consensus verification and explicit exclusion of ambiguous cases or frames with inadequate preparation.

2.2 Metrics

Segmentation: Dice coefficient, IoU (Jaccard index), precision, recall, Fβ-score, mean absolute error (MAE), boundary-specific metrics (Hausdorff distance, normalized surface dice).
Detection: Mean average precision (mAP) across multiple IoU thresholds, recall, precision.
Structure-aware measures: S_α (structure-measure), E_ξ (enhanced-alignment) for capturing spatial and region-level accuracy [2309.05987].
Video: Temporal stability and tracking ability; assignment-specific metrics in tracking benchmarks (e.g., HOTA, MOTA, IDF1) [2503.24108].

3. Methodological Landscape

3.1 Classical Approaches

Prior to deep learning, segmentation relied on thresholding, color/texture descriptors, edge detection, curvature-based shape analysis, and SVM/MLP classifiers using engineered features. These struggled with robustness, especially under challenging imaging conditions or for unconventional polyp presentations [1609.01915].

3.2 Deep Learning Advances

3.2.1 Convolutional Encoder–Decoder Networks

U-Net and Variants: The U-Net topology, with encoder–decoder architecture and skip connections, has served as a workhorse for medical segmentation. Augmentations include residual connections (ResUNet/++), squeeze-and-excitation blocks (SE), and atrous spatial pyramid pooling [2311.18373].
Attention Augmentation: Channel and spatial attention modules (e.g., ACSNet, SANet) and reverse attention mechanisms (e.g., PraNet) selectively enhance discriminative regions [2309.05987].
Boundary and Refinement Modules: Boundary-aware attention and explicit loss terms (e.g., in SFA, boundary loss) are critical for accurate margin localization, addressing ambiguity at polyp–mucosa interfaces [2508.09189].

3.2.2 Transformer and Hybrid Architectures

Pure Vision Transformers (ViTs): Encoders such as Pyramid Vision Transformer (PVT), Swin Transformer, and MaskDINO aggregate long-range semantic dependencies, outperforming CNNs in global context modeling [2309.05987]; explicit multi-scale design via pyramid stages is essential for scale robustness.
CNN–Transformer Hybrids: These architectures parallelize local (CNN) and global (Transformer) representations, fusing outputs via gated or attention mechanisms to yield both accurate region prediction and sharp boundaries. Shifted-window self-attention, adaptive feature fusion, and learnable gating weights improve performance under imaging artifacts [2508.09189].

3.2.3 Multi-Stage and Refinement Systems

Cascaded Encoder–Decoder (DoubleU-Net, DoubleEDN): Two-stage (or multi-stage) networks leverage initial coarse masks to guide subsequent refinement, shown to consistently improve Dice/IoU metrics, especially in out-of-distribution tests [2110.01939].
Collaborative Refinement & Integrated Segmentation (CRIS): Tight coupling of backbone prediction and mask refinement under alternating dual-loss regimes minimizes speckle noise and sharpens regions (CRIS achieves 92.62% Dice on CVC-ClinicDB with U-Net backbone) [2405.19672].

3.2.4 Wavelet and Frequency Domain Models

Wavelet Cross-Band Integration: Dual-encoder architectures separately process grayscale and RGB signals, integrating via band-specific attention in the wavelet domain, yielding superior boundary precision (Dice up to 0.926 on CVC-ClinicDB) [2603.03682].

3.2.5 Foundation Models and Large-Scale Pretraining

SAM, MedSAM, GroundingDINO, DINOv2: Foundation models pretrained on multi-domain or medical-only corpora enable powerful zero- and few-shot generalization; domain adaptation via fine-tuning is generally required for optimal polyp segmentation [2503.24138]. MedSAM (medical variant of SAM) surpasses generic SAM in both boundary accuracy and IoU on all benchmarks.

3.2.6 Sequence and Temporal Reasoning

Video-based Segmentation and Tracking: Models explicitly leveraging temporal continuity—memory bank attention, temporal context transformers, and query-based unsupervised tracking—significantly improve segmentation in realistic video workflows (e.g., PolypSegTrack achieves Dice=91.4 on ETIS, HOTA=53.2) [2503.24108, 2603.04288].

3.2.7 Synthetic and Weakly-Supervised Paradigms

Synthetic Data Augmentation: Stable Diffusion-based generators and cut-based image translation mitigate annotation scarcity; pseudo-labeling and semi-/self-supervised learning via synthetic–real domain alignment achieve competitive results with minimal manual labels [2307.12033, 2508.06170].
Bounding Box-Only Supervision: Self-prompting architectures (YOLOv8→SAM2) achieve high segmentation accuracy with >10× reduction in annotation time, using detection outputs as prompts for powerful mask generators (e.g., SAM2) [2409.09484].

4. Benchmarks and Quantitative Comparisons

4.1 Static Image Segmentation

FLDNet: On CVC-ClinicDB, achieves mDice=0.905, mIoU=0.848, outperforming all CNN and previous ViT methods [2309.05987].
Hybrid Transformer+CNN: Maintains >0.90 DSC and >0.88 mIoU on multiple datasets, demonstrating superior recall on Kvasir-SEG (0.9555) [2508.09189].
Wavelet-Integrated Model: Achieves Dice=0.926, IoU=0.862 on CVC-ClinicDB, outperforming all non-frequency-based baselines [2603.03682].
DDANet: Dice=0.7874 on challenge hold-out; real-time inference at >69 FPS [2012.15245].
Foundation Models: MedSAM yields mIoU=0.835 (ASH), outperforming standard Mask R-CNN; joint GroundingDINO+MedSAM achieves best pipeline performance (mIoU up to 0.885) [2503.24138].

4.2 Video and Temporal Segmentation

Yolo-SAM2 Self-Prompting: mDice=0.808, mIoU=0.678 on PolypGen videos, performance gains over best prior by +29.2%/20.7% [2409.09484].
PolypSegTrack: Unified detection/segmentation/tracking with Dice=94.7 (Kvasir-SEG), 91.4 (ETIS), HOTA=53.2 on tracking [2503.24108].
EndoCV2022 Challenge: Temporal methods yield highest segmentation mean Dice (0.787), outpacing static models by >10% [2603.04288].

4.3 Multi-Architecture Evaluations

Synthetic Data-Driven Pipelines: Integration of Faster R-CNN detection and SAM refinement yields F1=90.98%, mask IoU=64.20% (LinkNet), Dice=77.53% [2508.06170].
CRIS (Refinement): Consistently improves Dice by ≥6.5% over base backbones across Kvasir-SEG and CVC-ClinicDB [2405.19672].

4.4 Large-Scale and Multi-Center Generalization

PolypDB Benchmarks: SSFormer-L achieves mIoU=0.8821 (WLI), Dice=0.9294; detection mAP_50 up to 0.925 (YOLOv6) [2409.00045].
PolypGen: DeepLabV3+ reaches DSC=0.82 on out-of-sample center; cross-center generalizability remains a bottleneck for single-site-trained models [2106.04463].

5. Training Protocols, Loss Functions, and Efficiency

Losses: Compound loss formulations are standard, combining cross-entropy (possibly weighted), Dice loss (pixel or boundary-aware), and custom boundary/region losses. Edge-aware weighting is often applied at boundary pixels [2309.05987, 2508.09189].
Optimizers: Adam/AdamW or SGD variants; early stopping or learning rate decay based on validation Dice.
Data Augmentation: Rotations, flips, color/brightness variation, scale/affine transforms, coarse dropout, and mosaic/mixup for detection.
Inference Speed: Modern models deliver real-time rates (≥45 FPS, MKDCNet; >69 FPS, DDANet; up to 182 FPS, ColonSegNet) on high-end GPUs [2206.06264, 2012.15245, 2011.07631].

6. Remaining Challenges and Future Directions

Small/Flat Polyp Detection: Performance declines sharply for objects with area ratio <0.025; transformer-based and hybrid architectures show improved robustness, but mean Dice for small polyps can be as low as 0.50 [2311.18373].
Boundary Localization: Even recent models can struggle with ill-defined or occluded boundaries; new architectures incorporating explicit frequency domain integration, boundary-aware attention, or mask refinement show measurable improvements [2603.03682, 2508.09189].
Domain Shift and Center Heterogeneity: Generalization across centers, imaging devices, and patient populations remains unsolved; federated learning and synthetic augmentation are active areas [2409.00045].
Temporal Stability: Incorporation of mid/long-range temporal context in video processing is essential for clinical deployment, with memory bank and query-based tracking models offering SOTA performance [2503.24108, 2603.04288].
Annotation Efficiency: Weakly-, semi-supervised, and synthetic data-driven approaches are enabling rapid progress, but fully automated annotation remains dependent on robust transfer learning and cross-domain adaptation [2307.12033, 2508.06170].
Clinical Integration: Real-time operation, interpretability, uncertainty estimation, and combined detection-classification-tracking systems are active areas for translational research.

7. Resources and Benchmarking

Model Zoo and Data Links: Centralized repositories for trained models, code, and datasets are available (e.g., https://github.com/taozh2017/Awesome-Polyp-Segmentation, https://github.com/DebeshJha/PolypDB).
Benchmark Leaderboards: PolypGen, PolypDB, SUN-SEG, CVC-ClinicDB/ColonDB, ETIS, and Kvasir-SEG offer established reference splits and leaderboards.
Evaluation Protocols: Standardization of metrics, train–validation–test splits, and reporting is critical for reproducible comparison and clinical translation.

In summary, polyp detection and segmentation have progressed from classical feature engineering to advanced hybrid deep learning architectures with explicit mechanisms for handling multiscale, boundary, and temporal complexities. Recent advances—transformer-powered networks, promptable foundation models, real-time video-centric tracking, and annotation-efficient training—are converging toward robust, deployable systems for clinical colonoscopy. Challenges persist in generalization, polyps at the edge of visibility, and data/annotation scarcity, but the pace and breadth of research are rapidly closing the gap to clinical-grade, globally robust deployments [2309.05987, 2508.09189, 2311.18373, 2603.03682, 2503.24108, 2409.09484, 2405.19672, 2106.04463, 2409.00045, 2503.24138].