EndoVis Challenge: Surgical Vision Benchmarks

Updated 16 November 2025

EndoVis Challenge is a community-driven benchmarking initiative that defines standard datasets and tasks for surgical scene understanding, including instrument segmentation, depth estimation, and workflow recognition.
It leverages multi-center annotated datasets and advanced architectures, such as CNNs and transformers, to address challenges in real-time, robust AI performance under surgical conditions.
The initiative drives clinical translation by emphasizing reproducible evaluations, innovative multi-task learning, and robustness to intraoperative artifacts and domain shifts.

The Endoscopic Vision (EndoVis) Challenge is a multi-track, community-driven benchmarking initiative committed to advancing computer vision methodologies—particularly in the context of robot-assisted and minimally invasive surgery. Conceived to catalyze progress in surgical scene understanding, tool localization, workflow analysis, and robust perception, EndoVis now encompasses a spectrum of surgical tasks: instrument segmentation, depth estimation, domain adaptation, activity modeling, phase recognition, and 3D reconstruction, all on rigorously annotated multicenter datasets. EndoVis benchmarks are foundational for reproducible algorithmic assessment and for the development of reliable, clinically translatable AI assistance in endoscopy.

1. Historical Evolution and Motivation

The EndoVis Challenge originated at MICCAI in 2015 with a focus on ex vivo tissue; subsequent years saw the scope expand to realistic da Vinci instrument segmentation (2017), scene segmentation (2018), and dense depth estimation with structured light (SCARED, 2019) (Allan et al., 2020). A recurring theme is the use of well-calibrated datasets annotated by a combination of robot kinematics, instrument CAD meshes, and surgical expert correction. Progressively, EndoVis diversified to include semantic segmentation of anatomy and instruments (CaDIS, cataract surgery, 2020) (Luengo et al., 2021), workflow and phase recognition (PitVis, pituitary surgery, 2023) (Das et al., 2 Sep 2024), action triplet modeling (CholecTriplet, 2021–22) (Nwoye et al., 2023), and context-aware benchmarks integrating surgical phase, keypoint, and instance segmentation (PhaKIR, cholecystectomy, 2024) (Rueckert et al., 22 Jul 2025). This structured expansion reflects the community’s recognition that robust surgical AI must solve multi-faceted, temporally consistent, and contextually rich tasks, not isolated vision problems.

2. Core Tasks and Dataset Design

EndoVis tasks span the following principal categories:

Instrument Segmentation: Binary or multi-class pixel-wise identification of instruments, complemented by anatomical and device classes in later years. Datasets include the da Vinci Xi robotic system (2017/2018) (Allan et al., 2020), Sinus-Surgery live/cadaver videos (Qin et al., 2020), and multicenter cholecystectomy sets (Rueckert et al., 22 Jul 2025). Annotation combines kinematics-based automatic masks (earlier years), manual polygon labeling (anatomy, devices), and more recently, multi-class, multi-instance maps (instance segmentation).
Depth Estimation: SCARED (2019) (Allan et al., 2021) provides rectified stereo pairs and structured-light ground-truth point clouds for porcine subjects, with dense metric accuracy computed in millimeters. EndoDepth (2024) extends this with systematic robustness assessment under 16 corruption types (Reyes-Amezcua et al., 30 Sep 2024).
Workflow, Phase, and Action Triplet Recognition: Challenges such as PitVis (pituitary workflow) (Das et al., 2 Sep 2024) and CholecTriplet (laparoscopic action triplets) (Nwoye et al., 2023) leverage longitudinal video annotation, including per-second surgical step labels, tool usage, and fine-grained action ⟨Instrument, Verb, Target⟩ triplet detection.
Keypoint and Instance Localization: The PhaKIR sub-challenge (Rueckert et al., 22 Jul 2025) defines simultaneous frame-wise phase classification, instrument keypoint estimation (variable keypoints and visibilities per instance), and multi-class instance segmentation on annotated human cholecystectomy videos.
Domain Adaptation and Robustness: SurgVisDom (2020) (Zia et al., 2021) explores transfer from synthetic VR simulation clips to clinical-like porcine videos, with explicit domain adaptation tasks. SegSTRONG-C (2024) probes segmentation robustness to non-adversarial corruptions (bleeding, smoke, low illumination) (Ding et al., 16 Jul 2024).

Each dataset is released with carefully defined splits, annotation standards, and, in the case of newer tracks, bootstrapped confidence intervals and ranking stability plots.

3. Methodological Advances and Evaluation Protocols

EndoVis has served as the proving ground for a range of architectures:

Encoder-Decoder CNNs and Transformers: Early successes (U-Net, DeepLab) (Allan et al., 2020, Luengo et al., 2021) have given way to transformer-based methods for both spatial and temporal encoding, e.g., Mask2Former (instance segmentation) (Rueckert et al., 22 Jul 2025), Swin-based spatio-temporal backbones (workflow) (Das et al., 2 Sep 2024), and vision transformers for robust matching (EndoMatcher) (Yang et al., 7 Aug 2025).
Augmentation and Robustness: Techniques include elastic/projective transforms, AutoAugment policies (Ding et al., 16 Jul 2024), simulation of surgical artifacts, and exposure/color shift modeling (EndoDepth) (Reyes-Amezcua et al., 30 Sep 2024).
Weak Supervision and Multi-round Bootstrapping: Instrument localization under weak supervision via video-level tool presence events (WS-YOLO) (Wei et al., 2023) and weakly-supervised action triplet detection using CAMs, multi-task heads, and pre-trained detectors (CholecTriplet2022) (Nwoye et al., 2023).
Multi-task and Context-driven Modeling: Simultaneous modeling of phase, instrument localization, and activity context (PhaKIR (Rueckert et al., 22 Jul 2025), PitVis (Das et al., 2 Sep 2024), CholecTriplet2022 (Nwoye et al., 2023)), cross-task feature sharing, and joint post-processing algorithms.
Temporal Consistency Enforcement: EMA-based smoothing, harmonic filtering, and transformer sequence decoders are key in workflow and activity recognition (Das et al., 2 Sep 2024, Rueckert et al., 22 Jul 2025).

Metrics are chosen to match task goals: Intersection over Union (IoU), Dice Similarity Coefficient (DSC), Mean Average Precision (mAP at varying IoU thresholds), Hausdorff Distance for mask boundaries, Object Keypoint Similarity (OKS), surgical phase Balanced Accuracy, and, for robustness, composite mDERS (mean Depth Estimation Robustness Score) (Reyes-Amezcua et al., 30 Sep 2024).

4. Landmark Results, Benchmarks, and Limitations

EndoVis challenges consistently report state-of-the-art performance figures and analyze failure modes:

Segmentation: Top binary segmentation on da Vinci Xi (2017) exceeds 90% DSC (Qin et al., 2020); multi-class scene segmentation sees a drop to ≈62% mIoU as anatomical classes and devices are introduced (Allan et al., 2020).
Semantic Segmentation (CaDIS): Anatomy-only segmentation achieves mIoU 0.94; instrument-only 0.53; combined, 0.56 (Luengo et al., 2021). Explicitly modeled boundaries improve performance on thin, complex anatomy.
Depth Estimation (SCARED/EndoDepth): Modern pipelines approach 2.95–3.60 mm MAE (Allan et al., 2021); under corruption, robustness varies dramatically with mDERS ranging 0.23–0.31 for leading monocular methods (Reyes-Amezcua et al., 30 Sep 2024).
Workflow Recognition (PitVis): Spatio-temporal multi-task models raise step Macro-F1 from <15% to >60%, instrument Macro-F1 from ~34% to ~42% (Das et al., 2 Sep 2024).
Multi-task Contextual Analysis (PhaKIR): Transformer-powered instance segmentation achieves DSC 35.5%, mAP 36.05% on challenging cholecystectomy data; phase recognition F1 69.1%, Balanced Accuracy 84.2% (Rueckert et al., 22 Jul 2025).
Action Triplet Detection (CholecTriplet2022): Top triplet detection mAP ≈35% and joint detection+localization mAP 4.5%; weak supervision via CAMs remains limited for high-precision box detection (Nwoye et al., 2023).
Robustness to Corruption (SegSTRONG-C): U-Net + AutoAugment achieves DSC/NSD ≈0.79/0.67 under smoke, bleeding, low brightness (Ding et al., 16 Jul 2024); domain-specific drops highlight the need for truly representative corruptions.

Limitations and failure cases are recurrent themes: class imbalance, anatomy/instrument boundary ambiguity, poor generalization to new centers/intuitions, lack of temporal modeling in instance segmentation, and susceptibility of DNNs to subtle scene corruptions.

5. Current Directions: Robustness, Generalization, and Real-time Constraints

Recent EndoVis tracks (EndoDepth, SegSTRONG-C, EndoMatcher) explicitly target universal robustness—how models degrade under plausible intraoperative artifacts, domain shifts, and multi-institutional data (Reyes-Amezcua et al., 30 Sep 2024, Ding et al., 16 Jul 2024, Yang et al., 7 Aug 2025). Recommendations include:

Training on Corrupted and Synthetic Data: Systematic corruption simulation, photorealistic artifact rendering, and domain-randomized synthetic data are key for closing the lab-to-clinic gap.
Multi-domain and Multi-objective Pre-training: EndoMatcher leverages a massive multi-domain dataset (Endo-Mix6, ~1.2M pairs across six domains) and a progressive multi-objective training scheme to balance feature representation and avoid negative transfer.
Real-time Pose, Tracking, and Reconstruction: Efficient architectures (47 FPS/1K matches for EndoMatcher (Yang et al., 7 Aug 2025), 86 FPS for EndoWave (Wu et al., 27 Oct 2025)) enable deployment in intraoperative assistance loops.
Explicit Temporal Modeling: Future EndoVis benchmarks advocate for joint spatial-temporal modeling in segmentation, keypoints, and workflow—potentially integrating robot kinematics, stereo signals, and multi-modal cues.

Across tasks, there is consensus that robust, adaptive, context-driven and temporally aware networks will underpin next-generation surgical AI.

6. Impact, Standards, and Community Practices

EndoVis is not only a technical contest but serves as the field standard for dataset design, annotation practice, unbiased benchmarking, and transparent algorithmic reporting. Adherence to guidelines such as Metrics Reloaded and BIAS (Rueckert et al., 22 Jul 2025) ensures replicable, fair, and robust comparative analyses. All major datasets (e.g., CholecT50, EndoNeRF, SCARED, PitVis, PhaKIR) are released under open licensing protocols with documented annotation, ethical approval, and, increasingly, multi-institutional provenance.

EndoVis Tools and Evaluation Protocols ───────────────────────────────────── Component | Description | Notable Implementation --------------------------------------------------|--------------------------------------------------|------------------------ Dataset standardization | Multi-center, multi-class annotation, bootstrapping| PhaKIR (Rueckert et al., 22 Jul 2025), SCARED (Allan et al., 2021) Metrics protocols | Composite scores (DSC, mAP, mDERS, OKS, BA, HD95) | EndoDepth (Reyes-Amezcua et al., 30 Sep 2024), PhaKIR (Rueckert et al., 22 Jul 2025) Open-source code, docker templates | Facilitate community engagement, reproducibility | CholecTriplet2022 (Nwoye et al., 2023) Multi-track, temporally linked benchmarking | Enables cross-task, downstream integration | PitVis (Das et al., 2 Sep 2024), SegSTRONG-C (Ding et al., 16 Jul 2024)

The EndoVis Challenge defines and catalyzes the technical frontier for scene understanding in surgical vision, encouraging convergence on standardized datasets, objective evaluation, robust multi-task algorithms, and clinically meaningful, reproducible results.

7. Future Directions and Open Problems

Several avenues merit further exploration:

Temporal, Spatio-Contextual Integration: Unified spatial-temporal and cross-task architectures (e.g., tool tracking and workflow with mutual priors).
Universal Robustness: Simulation-in-the-loop, uncertainty-driven correction, and adaptive normalization for all major artifact types.
Multi-modal Fusion: Incorporating kinematic signals, stereo/depth inputs, and audio for comprehensive real-time guidance.
Human-centric Annotation and Usage: Expansion to human datasets, risk structure maps, explainable assistance, and federated multicenter data acquisition.
Benchmark Expansion and Standards: Regular updates to datasets, inclusion of new surgical procedures, richer annotation targets (scene graphs, biophysical signals).

EndoVis remains the definitive community reference for benchmarking, algorithmic innovation, and deployment standards in computer vision for endoscopy, catalyzing progress from lab to operating theater.