Papers
Topics
Authors
Recent
2000 character limit reached

CholecT50: Surgical Action Triplet Benchmark

Updated 15 December 2025
  • The dataset is a large-scale, clinically acquired corpus of laparoscopic cholecystectomy videos annotated with instrument-action-target triplets for rigorous surgical action recognition.
  • It provides detailed, frame-level annotations, standardized data splits, and comprehensive evaluation metrics including mAP and F1-scores that support fine-grained analysis.
  • Extensions enable instance segmentation and generative modeling, addressing challenges like class imbalance and zero-shot recognition of novel triplets.

The CholecT50 dataset is a large-scale, clinically acquired corpus of laparoscopic cholecystectomy videos comprehensively annotated at the frame level with fine-grained “surgical action triplet” labels. Each annotation captures the fundamental interaction between an instrument, a surgical verb or action, and the anatomical target—formalized as instrument,verb,target\langle \text{instrument}, \text{verb}, \text{target} \rangle. Designed as a benchmark dataset for computer vision research in surgical workflow analysis, CholecT50 has become the reference resource for robust, fine-grained activity recognition, zero-shot reasoning, and, via extensions, instance segmentation and generative modeling in surgical environments (Nwoye et al., 2021, Nwoye et al., 2022, Nwoye et al., 2022, Alabi et al., 23 Jun 2024, Nwoye et al., 12 Jul 2024, Sharma et al., 25 Mar 2025).

1. Dataset Composition and Annotation Schema

CholecT50 consists of fifty full-procedure laparoscopic cholecystectomy videos, primarily sourced from the Cholec80 dataset (45 videos) and supplemented with five additional in-house procedures. Video resolutions range from 480×854 to 1920×1080 pixels and are down-sampled to 1 frame per second (fps) for annotation. The total dataset comprises approximately 100,900–101,000 annotated frames, corresponding to roughly 28 hours of recorded operative time (Nwoye et al., 2021, Nwoye et al., 2022, Nwoye et al., 12 Jul 2024).

Annotation Structure

Each frame is labeled with one or more triplets capturing fine-grained tool–anatomy interactions:

  • Instrument: e.g., grasper, hook, scissors, clipper, bipolar, irrigator
  • Verb (Action): e.g., grasp, dissect, retract, coagulate, clip, cut, aspirate, irrigate, pack, null
  • Target (Tissue/Structure): e.g., gallbladder, cystic duct, cystic artery, blood vessel, fluid, liver, omentum, peritoneum, abdominal wall/cavity, gut, specimen bag, adhesion, null

After clinical curation, 100 distinct instrument,verb,target\langle \text{instrument}, \text{verb}, \text{target} \rangle triplet classes are retained, representing plausible operative actions. The annotation format encodes instruments, verbs, targets, and triplet classes as multi-hot binary vectors. On average, each frame contains $1.6$ annotated triplets (Nwoye et al., 2021, Nwoye et al., 2022).

Class Balance

CholecT50 exhibits heavy class imbalance. For example, “grasper” appears in roughly 71%71\% of frames and “specimen bag” in roughly 6%6\%; the most frequent triplet (“grasper, retract, gallbladder”) occurs over $48,000$ times, while others are rare, producing a Gini coefficient of 0.76\approx 0.76 on triplet frequencies (Nwoye et al., 2022).

2. Data Splitting Protocols and Accessibility

Standardized data splits have been established to facilitate rigorous benchmarking and reproducibility. The splits include (Nwoye et al., 2022, Nwoye et al., 2022):

Split Type Train Val Test Purpose
Rendezvous (RDV) 35 5 10 Primary canonical split
Challenge Split 45 - 5 CholecTriplet2021 challenge (+ val)
5-Fold CV 5×10 - - Cross-validation over the full 50 videos

Splits are stratified by procedure length to ensure balanced distribution of case complexity. Annotations, preprocessing scripts, and evaluation code are available through the CAMMA team and Grand Challenge web portals, subject to a research-only license (Nwoye et al., 2022).

3. Benchmarking Protocols and Evaluation Metrics

CholecT50 underpins the evaluation of fine-grained recognition systems, with all principal metrics formalized and implemented in the ivtmetrics Python library (Nwoye et al., 2022).

Recognition Metrics

  • Average Precision (AP): Calculated per class as the area under the precision-recall curve.
    • API\mathrm{AP}_I, APV\mathrm{AP}_V, APT\mathrm{AP}_T: for instrument, verb, and target, respectively.
    • APIVT\mathrm{AP}_{IVT}: triplet AP, the main metric.
  • Mean Average Precision (mAP): Mean of AP across all triplet classes:

mAP=1Ni=1NAPi\mathrm{mAP} = \frac{1}{N} \sum_{i=1}^N \mathrm{AP}_i

  • F1-Score:

F1=2prp+rF_1 = \frac{2pr}{p + r}

where pp is precision and rr is recall.

Association and Localization Metrics

  • Triplet Association Scores (TAS): Quantifies joint recognition and localization errors (including LM, pLM, IDS, IDM, MIL, RFP, RFN categories) (Nwoye et al., 2022).
  • Detection AP and IoU: For segmentation/localization extensions, AP is computed over predicted bounding boxes or masks with standard IoU thresholds.

4. Specialized Zero-Shot and Base-to-Novel Benchmarks

A key challenge addressed using CholecT50 is zero-shot recognition of novel triplets. The “base-to-novel” protocol divides observed triplets into exclusive train (base) and test (novel) sets. Two established splits, used especially for vision-language modeling studies, are (Sharma et al., 25 Mar 2025):

  • Unseen-Target (UT): Novel triplets have previously unseen targets (e.g., cystic artery, peritoneum).
    • Base: $36$ triplets, 45\sim45k train, 3.5\sim3.5k val, 12\sim12k test frames.
    • Novel: $18$ triplets, 700\sim700 val, 1.9\sim1.9k test frames.
  • Unseen-Instrument-Verb (UIV): Novel triplets have unobserved instrument-verb pairs.
    • Base: $28$ triplets, 18\sim18k train, 1.7\sim1.7k val, 4.3\sim4.3k test frames.
    • Novel: $21$ triplets, 1.0\sim1.0k val, 3.7\sim3.7k test frames.

For both, any frame containing both base and novel triplets is excluded to prevent data leakage. Models are trained exclusively on base triplets and evaluated zero-shot on novel triplets. Performance is reported as F1@3 (top-3 predictions per frame) and mAP for base, novel, and harmonic mean (HM) scores (Sharma et al., 25 Mar 2025).

Example Reported Results (fine-CLIP, Unseen-Target setting):

Metric Base Novel HM
F1@3 (%) 61.71 39.78 48.38
mAP (%) 31.72 32.17 31.95

5. Extensions: Segmentation and Generative Modeling

While the original CholecT50 release does not include segmentation labels, CholecInstanceSeg augments a subset of CholecT50 with pixel-precise mask annotations for tool instance segmentation (Alabi et al., 23 Jun 2024). CholecInstanceSeg provides over 30,998 frames from CholecT50, annotated for up to seven tool classes (including “Snare”) and includes rigorous quality controls (manual review, human-in-the-loop correction, inter-annotator PQ >90>90).

Instance segmentation methods such as Mask R-CNN and Mask2Former establish baseline reference points on this extension (e.g., Mask2Former achieves $0.682$ mAP). This enables research into simultaneous triplet recognition, spatial localization, and object segmentation within surgical video scenes.

The dataset is also used for text-to-image modeling in the surgical domain. Using CholecT50’s action triplet annotations, studies have formulated conditional image synthesis setups with T5-based triplet text embeddings and developed instrument-centric class balancing methods to address heavy instrument imbalance (Nwoye et al., 12 Jul 2024). For instance, “Surgical Imagen” achieves FID $3.7$ and CLIP 26.8%26.8\% in photorealistic, action-aligned sample generation conditioned on triplet prompts.

6. Dataset Challenges and Limitations

CholecT50 presents multiple challenges:

  • Long-Tailed Distribution: A handful of common triplets dominate, while many are rare or absent. Instrument frequency varies drastically (e.g., “grasper” 42%\approx42\%, “specimen bag” 0.2%\approx0.2\%) (Nwoye et al., 12 Jul 2024). Class balancing via log-inverse weighting addresses some model training biases:

wi=1log(1+Ni)w_i = \frac{1}{\log(1 + N_i)}

  • Ethics and Scope: Patient privacy and the necessity for expert annotation restrict expansion and broader distribution (Nwoye et al., 12 Jul 2024). Annotations cover only cholecystectomy, with rare or complex events underrepresented.
  • Modeling Complexity: Fine-grained triplet recognition is challenging for vision-language and end-to-end models due to contextual ambiguity, unbalanced classes, and strong combinatorial novelty in real-world settings (Sharma et al., 25 Mar 2025).

Methods that exploit the two-level annotation hierarchy (instrument,target\langle \text{instrument}, \text{target} \rangle as root; instrument,verb,target\langle \text{instrument}, \text{verb}, \text{target} \rangle as leaf) and leverage semantic clustering, specialized adaptation, and hierarchical loss functions demonstrate improved generalization and discrimination (Sharma et al., 25 Mar 2025).

7. Impact and Community Adoption

CholecT50 constitutes the foundation for major surgical action recognition challenges, including CholecTriplet2021 and the benchmarking of models such as Tripnet, Attention Tripnet, and Rendezvous architectures. These have enabled fair comparison and global progress tracking in surgical workflow analysis (Nwoye et al., 2022, Nwoye et al., 2021). The dataset’s structure and benchmarking tools (ivtmetrics) have set a standard for rigorous, component-disentangled, and association-aware evaluation in fine-grained visual scene understanding for the operating room environment.

The open-access availability of extension datasets (e.g., CholecInstanceSeg), comprehensive metrics, and wide adoption across multiple vision-language and generative modeling paradigms affirm CholecT50’s status as an indispensable resource in surgical data science (Alabi et al., 23 Jun 2024, Sharma et al., 25 Mar 2025, Nwoye et al., 12 Jul 2024).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to CholecT50 Dataset.