PhaKIR Challenge: Multi-Task Surgical Vision
- PhaKIR Challenge is a benchmark integrating phase recognition, keypoint estimation, and segmentation in RAMIS via a multi-center, temporally annotated dataset.
- It enables development of temporally aware models through clear annotation protocols and real-world surgical variability for assessing model generalization.
- The challenge supports joint multi-task learning and context-driven methodologies to advance automation and surgical workflow understanding.
The PhaKIR Challenge is a benchmark competition accompanying the Endoscopic Vision (EndoVis) challenge at MICCAI 2024, aimed at validating automated computer vision systems for surgical scene understanding in robot- and computer-assisted minimally invasive surgery (RAMIS). It addresses the need for robust, temporally aware, and context-driven methods for phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopic video, under real-world multi-center conditions. The PhaKIR Challenge is supported by the first unified, multi-institutional dataset jointly annotated for all three tasks, enabling integrative studies of surgical context and instrument localization.
1. Dataset Foundation and Collection Protocol
The PhaKIR Challenge leverages a novel dataset of laparoscopic cholecystectomy videos, extending beyond prior resources by providing multi-task, multi-center, and temporally comprehensive annotations. The dataset consists of thirteen full-length (23–60 min) monocular laparoscopy videos at 1920×1080 px, 25 FPS. These videos, sourced from three medical institutions—TUM University Hospital Rechts der Isar (MRI), Heidelberg University Hospital (UKHD), and Weilheim-Schongau Hospital (KWS)—are characterized by clinical variability and diversity in surgeon technique and endoscopic appearance (Rueckert et al., 22 Jul 2025, Rueckert et al., 9 Nov 2025).
Annotation protocol:
- Surgical phase recognition: Every frame (485,875 frames) carries a label corresponding to one of the seven Cholec80 phases or an “undefined” transition phase.
- Instrument keypoint estimation: For each instrument instance (annotated 1 fps, 19,435 images), precise pose keypoints are labeled.
- Instrument instance segmentation: At matching time points to keypoint frames (1 fps), instance-level, pixel-accurate masks are provided.
This annotation strategy enables both fine-grained spatial assessment (segmentation, keypoints) and high-resolution temporal analysis (phase labels), allowing for the exploitation of temporal context across complete surgical procedures.
2. Task Formulations and Benchmark Objectives
The Challenge centers on three interrelated tasks:
- Surgical phase recognition: Frame-wise classification problem, assigning each frame a phase label among the Cholec80 protocol’s steps plus transition. Evaluation employs multi-class accuracy, macro-averaged precision, recall, and F1, with temporal consistency as a secondary concern due to the contiguous procedural nature of the data.
- Instrument keypoint estimation: Structured regression/localization task, with the goal of predicting the 2D spatial positions of standard semantic keypoints for each visible instrument instance. Metrics of interest include mean Euclidean distance error, normalized localization error, and keypoint detection rates.
- Instrument instance segmentation: Pixel-level segmentation challenge, requiring the separate delineation of each instrument, including overlapping, occluded, or partially visible tools. Main metrics are mean Intersection over Union (mIoU), mean Average Precision (mAP), and Dice coefficient.
The design strongly emphasizes the joint modeling of procedural context and instrument location, motivating the integration of phase information to improve the robustness and interpretability of spatial predictions.
3. Context-Driven and Temporally Aware Methodologies
A key innovation of the PhaKIR Challenge is the explicit encouragement and technical enablement of context-aware and temporally integrated models. Unlike prior datasets, the full procedural timeline is preserved for all cases, supporting the use of temporal architectures (e.g., LSTM, GRU, TCN, or transformer variants) and causal or bidirectional models for phase recognition. For the spatial tasks, leveraging phase information as an input or conditioning variable is both supported and anticipated as a performance driver.
By unifying phase, keypoint, and segmentation labels on a single multi-center dataset, the challenge enables architectures that propagate contextual knowledge across both time and task, supporting multi-task learning setups and cross-task supervision strategies. This joint structure is intended to mimic clinical reasoning, where knowledge of current progress, tool usage, and anatomical exposure are interdependent.
4. Evaluation Procedures, Metrics, and Submission Structure
Submissions are evaluated according to the BIAS (Biomedical Image Analysis ChallengeS) guidelines to ensure fair, transparent, and reproducible comparison (Rueckert et al., 22 Jul 2025). Standardization includes:
- Leaderboard metrics: Macro-averaged F1 for phase recognition; mean Euclidean and normalized localization error for keypoint estimation; mIoU, mAP, and Dice for segmentation.
- Dataset splits: Official train/test splits maintain surgeon, center, and patient independence between sets. Full label disclosure occurs post-challenge for general research utility.
- Submission logistics: Participants submit prediction files for each task across the official test set. Automated scripts compute and publish scores.
- Reporting: In line with BIAS, results are published with detailed task-by-task and per-case breakdowns, facilitating analysis of generalization to unseen centers and rare surgical events.
A plausible implication is that multi-center diversity and realistic annotation distribution challenge models to generalize across both technical and biological variability, setting a stringent benchmark for deployment-oriented research.
5. Relation to Existing Benchmarks and Advances over Prior Art
Previous datasets and challenges for endoscopic scene understanding have largely addressed isolated tasks—either workflow (phase) recognition, instrument segmentation, or pose estimation—often with limited temporal coverage, static frames, or mono-center data, restricting joint or context-driven modeling. The PhaKIR Challenge is, to the knowledge cited, the first to provide a multi-institutional dataset with frame-synchronous, interrelated annotations for all three tasks (Rueckert et al., 9 Nov 2025).
This enables:
- Joint benchmarking: Synchronized ground truth for three canonical tasks on identical video frames.
- Multi-center evaluation: Systematic assessment of generalization to diverse clinical environments and equipment.
- Realistic procedural coverage: Continuous video from skin incision to closure for each case, supporting sequence-aware learning and evaluation.
- Benchmark extension potential: The resource is available via Zenodo for further studies, promoting reproducibility and comparative research.
6. Impact and Directions for Surgical Scene Understanding
By providing a unified and realistic evaluation platform for temporally aware, context-driven perception models in RAMIS, the PhaKIR Challenge advances the state-of-the-art in surgical scene understanding. Results from the challenge form a new baseline for the development and validation of systems used in surgical training, skill assessment, workflow analysis, and autonomous assistance. The explicit joint annotation format supports research into multi-task, transfer, and continual learning strategies within surgical computer vision.
A plausible implication is that future research can leverage the temporal, multi-task, and multi-center structure of the PhaKIR dataset to address the persistent challenge of robust automation and analytics in real-world surgery settings, with direct translational implications for patient safety and operational efficiency.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free