PhaKIR: Surgical Phase & Instrument Dataset

Updated 16 November 2025

The paper presents a unified multi-task dataset integrating phase, instrument segmentation, and keypoint annotations for comprehensive surgical scene understanding.
The dataset comprises high-definition, temporally coherent laparoscopic cholecystectomy videos from multiple German hospitals to benchmark vision models.
The resource standardizes evaluation across surgical phase recognition, segmentation, and keypoint estimation, enabling joint learning and domain shift analysis.

The Surgical Procedure Phase, Keypoint, and Instrument Recognition (PhaKIR) dataset is a comprehensive, multi-institutional video resource for benchmarking surgical workflow recognition, instrument pose estimation, and instance segmentation in laparoscopic minimally invasive surgery. Developed for the Endoscopic Vision (EndoVis) Challenge at MICCAI 2024, PhaKIR uniquely integrates temporally resolved, frame-synchronized annotations across three related tasks within each of its full-length, real-world surgical videos. By consolidating phase, instrument instance, and keypoint labels onto the same sequences, PhaKIR establishes a unified standard for evaluating context- and temporally aware vision models in robot- and computer-assisted minimally invasive surgery (RAMIS). The dataset and corresponding challenge facilitate reproducible benchmarking and exploration of joint modeling paradigms in surgical scene understanding (Rueckert et al., 22 Jul 2025, Rueckert et al., 9 Nov 2025).

1. Composition and Collection

PhaKIR comprises original high-definition video of laparoscopic cholecystectomy procedures collected at three German hospitals:

Medical Center	Number of Videos	Video IDs
TUM University Hospital Rechts der Isar (MRI)	6	1,2,3,4,5,11
Heidelberg University Hospital (UKHD)	1	7 (HeiChole2.mp4)
Weilheim-Schongau Hospital (KWS)	1	13

All videos were acquired at 1920 × 1080 px and 25 frames per second using monocular endoscopes. Procedures range 28:31 to 57:03 minutes in duration (mean ≈ 40 min), with 486,875 raw frames covering 323:55 minutes after pre-processing. Frames outside the abdominal cavity were systematically excluded, with anonymization cuts defined by accompanying CSV files. No proprietary modalities were used; all data is standard RGB video. The eight videos constitute the released training set, spanning diverse real-world surgical activity and inter-center variability.

2. Annotation Protocols

Unified three-layer annotation was executed for every video, enabling tightly coupled multi-task evaluation:

2.1 Surgical Phase Recognition

Taxonomy: Eight classes; seven from Cholec80 (Preparation, Calot’s Triangle Dissection, Clipping & Cutting, Gallbladder Dissection, Gallbladder Packaging, Cleaning & Coagulation, Extraction), and an eighth “Undefined” class for ambiguous intervals.
Procedure: Manual marking of phase start and end timestamps enabled automatic frame-wise propagation of phase labels, resulting in 485,875 labeled frames per 25 fps sequence.

2.2 Instrument Instance Segmentation

Classes: Nineteen distinct surgical instruments (e.g., Grasper, Clip-Applicator, Suction-Rod, Argonbeamer).
Format: Annotators used CVAT polygon tools; exported PNG masks encode class (R/G channels) and instance ID (B channel), with consistent RGB mappings per instrument and unique B values per instance.
Density: 1 fps sampling (every 25th frame), amounting to 19,435 mask images across the dataset.
Validation: A three-stage review pipeline yielded a multi-instance, multi-class Dice score of 83.64% for segmentation agreement.

2.3 Instrument Keypoint Estimation

Specification: Keypoints per instrument: 2–4 per class, defined as EndPoint (EP), ShaftPoint (SP), Tip1 (T1), and (where applicable) Tip2 (T2). Detailed grouping:
- 4 keypoints: 12 classes (e.g., Scissor, Blunt Grasper)
- 3 keypoints: 4 classes
- 2 keypoints: 4 classes
Format: CVAT point tool, stored as COCO-style JSON with visibility attributes (visible/occluded/not available); 1 fps sampling (19,435 frames).
Validation: Three-stage visual review, including temporal consistency, but without formal quantitative inter-annotator agreement.

These three annotation layers are frame-aligned and thus enable synchronized multi-task learning.

3. Data Structure, Distribution, and Access

PhaKIR data adhere to FAIR principles and are organized to facilitate scalable experimentation:

File Structure: Each video (Video_xx.zip) contains:
- Video_xx.mp4 (original video)
- Video_xx_Cuts.csv (valid intra-abdominal frame indices)
- Video_xx_Phases.csv (frame-to-phase mappings)
- Video_xx_Masks.zip (segmentation PNGs in subfolders of 1000)
- Video_xx_Keypoints.json (framewise keypoints plus visibility flags)
Naming and Alignment: Frame indices are zero-padded and aligned across all annotation files and masks for easy correspondence.
Splits: The eight videos comprise the training set of the PhaKIR Challenge. The corresponding test set remains undisclosed; user-defined splits (e.g., leave-one-video-out or leave-one-hospital-out) are recommended for local validation protocols.
Access: PhaKIR is distributed under CC-BY-NC-SA 4.0, available via Zenodo (https://zenodo.org/records/15740619). Citing the dataset, challenge overview [rueckert2025comparative], and HeiChole challenge publication [wagner2023comparative] is required.

4. Benchmark Tasks and Evaluation Metrics

PhaKIR is designed for benchmarking in three core vision tasks:

4.1 Surgical Phase Recognition

Task: Framewise classification ( $C=8$ classes).
Metrics:
- Accuracy:
$\mathrm{Acc} = \frac{1}{N}\sum_{i=1}^{N}\mathbf{1}\{y_i = \hat{y}_i\}$ - Per-class F1-score:

$\mathrm{F1}_c = \frac{2\;\mathrm{TP}_c}{2\,\mathrm{TP}_c + \mathrm{FP}_c + \mathrm{FN}_c}$

for each $c=1\ldots8$ .

4.2 Instrument Instance Segmentation

Task: Pixel-wise class and instance prediction.
Metrics:
- Intersection over Union (IoU) per class:
$\mathrm{IoU}_c = \frac{|G_c \cap P_c|}{|G_c \cup P_c|}$ - Mean IoU (mIoU):

$\mathrm{mIoU} = \frac{1}{C}\sum_{c=1}^{C}\mathrm{IoU}_c$ - Dice Score (per instance):

$\mathrm{Dice}_k = \frac{2\,|G_k \cap P_k|}{|G_k| + |P_k|}$ - Mean Average Precision (mAP):

$\mathrm{AP}_c = \int_{0}^{1} p_c(r)\,dr,\quad \mathrm{mAP} = \frac{1}{C}\sum_{c=1}^{C}\mathrm{AP}_c$

4.3 Instrument Keypoint Estimation

Task: Regression of keypoint $(x, y)$ coordinates plus visibility.
Metric: Percentage of Correct Keypoints (PCK) at threshold $\alpha$ (relative to image size):

$\mathrm{PCK}(\alpha) = \frac{1}{K}\sum_{i=1}^{K} \mathbf{1}(\|d_i\|_2 \leq \alpha \cdot \max(H, W))$

where $d_i$ is the Euclidean error for keypoint $i$ ; $K$ is the number of annotated keypoints.

Temporal consistency metrics (e.g., path smoothness) were discussed but not formalized.

Quantitative baseline performance values are not included in the dataset descriptor, but are reported in the PhaKIR Challenge overview [rueckert2025comparative], which involved 66 registered teams.

5. Dataset Statistics, Variability, and Annotation Quality

5.1 Surgical Phase and Instrument Distributions

Phase Proportions: Gallbladder Dissection (P4) represents the longest duration (~30% of annotated frames); Clipping & Cutting (P3) and Extraction (P7) have shorter durations (~10% each).
Instrument Frequency: The Grasper appears in approximately 12,000 of the 19,435 segmentation/keypoint-annotated frames; the Clip-Applicator in ~3,500 frames; several instruments (e.g., Argonbeamer) appear in fewer than 500 frames.

5.2 Inter-Center Variability

The dataset exhibits diversity in surgical style, phase timing, and instrument utilization. This reflects genuine procedural practice differences across MRI, UKHD, and KWS. Such domain variability makes PhaKIR suitable for assessing both generalization and domain shift (e.g., via leave-one-hospital-out evaluation).

5.3 Annotation Quality

Segmentation: Concordance was quantitatively assessed, with multi-instance, multi-class Dice score of 83.64% post-final review.
Keypoints: No numeric inter-annotator agreement was published, but all annotations underwent three-stage visual and temporal consistency review.
Phases: Validated for plausibility but without formal kappa metrics.

6. Strengths, Limitations, and Prospective Applications

Strengths

Unified, Multi-Task Annotation: Simultaneous phase, segmentation, and keypoint labels on identical sequences enables benchmarking of joint learning and temporally aware models.
Complete Procedure Context: Full-length surgeries allow design and assessment of models that exploit long-range temporal context.
Multi-Center Variability: Inter-institutional data supports evaluation of cross-site and domain-adaptive methods.
Standards Compliance: Data structure meets FAIR principles; comprehensive annotation protocols ensure reproducibility.

Limitations

Procedure Type: Dataset is limited to eight cholecystectomy cases.
Geographic Scope: All data originate from German centers; results may not generalize to other practices.
Modality Restrictions: Only standard RGB video is included—no depth, fluorescence, or kinematic data.

Use Cases

Design and benchmarking of multi-task and temporally aware models for surgical scene understanding.
Transfer learning applications, especially leveraging relationships to related datasets (e.g., Cholec80, EndoVis).
Domain shift and generalizability studies through center- or video-based data splits.
Surgical phase transition analysis, instrument usage analytics, and anomaly detection pipelines.

7. Significance in Surgical Data Science

PhaKIR closes a critical gap in surgical computer vision benchmarking by providing a dense, temporally consistent, multi-institutional video dataset annotated for the three principal dimensions of surgical workflow recognition. Its structure explicitly encourages the paper of temporal context, joint learning, and domain shift—topics central to advancing robust, deployable computer vision solutions in surgical settings (Rueckert et al., 22 Jul 2025, Rueckert et al., 9 Nov 2025).

PDF Markdown Chat (Pro)

References (2)

Comparative validation of surgical phase recognition, instrument keypoint estimation, and instrument instance segmentation in endoscopy: Results of the PhaKIR 2024 challenge (2025)

Video Dataset for Surgical Phase, Keypoint, and Instrument Recognition in Laparoscopic Surgery (PhaKIR) (2025)

Follow Topic

Get notified by email when new papers are published related to Surgical Procedure Phase, Keypoint, and Instrument Recognition (PhaKIR) Dataset.