SurgToolLoc Dataset for Robotic Surgery

Updated 2 December 2025

SurgToolLoc is a large-scale annotated video dataset offering clip-level weak labels and frame-level strong annotations for surgical tool detection in robotic-assisted surgery.
It comprises 24,695 training clips and 93 test clips featuring diverse surgical tasks, providing a robust benchmark with detailed spatial localization for 14 tool types.
The dataset supports evaluation through weighted F1-score and COCO-style mAP metrics, enabling research in tool tracking, skill assessment, and camera control automation.

The SurgToolLoc dataset is a large-scale, publicly released corpus of annotated videos designed to benchmark weakly supervised learning algorithms for surgical tool presence detection and spatial localization in robotic-assisted surgery, with an emphasis on data captured from the da Vinci surgical system. The dataset supports research in surgical data science by providing both weakly labeled (clip-level) and strongly annotated (frame-level, spatial) ground-truth labels, facilitating challenges hosted at MICCAI and downstream tasks such as tool tracking, skill assessment, and camera control automation (Zia et al., 2023, Jenke et al., 25 Nov 2025).

1. Dataset Scope and Composition

SurgToolLoc comprises recordings of standardized robotic-assisted surgical training exercises using the da Vinci system. The training set contains 24,695 video clips—each 30 seconds in duration, at 60 frames per second (FPS), and 1280×720 resolution—resulting in approximately 44.5 million frames. These exercises span 11 standardized tasks (e.g., suturing, dissecting, cauterizing) and two animal modalities (anesthetized live pigs and ex vivo porcine organs). Video captures provide the endoscopic viewpoint from the surgeon's console with anatomical variation deemed non-central to the challenge objective.

The test set includes 93 video clips of similar nature, with variable durations (mean ± SD: 747.3 ± 579.9 s), also at 1280×720 resolution. For evaluation, test clips are subsampled to 1 FPS, yielding roughly 7,000–8,000 annotated frames.

The instrument catalog encompasses four robotic arms (USM1–USM4, with ‘nan’ denoting the uninstrumented camera arm) supporting at most three installeable tools per frame. Fourteen tool types are represented: needle_driver, cadiere_forceps, prograsp_forceps, monopolar_curved_scissors, bipolar_forceps, stapler, force_bipolar, vessel_sealer, permanent_cautery_hook_spatula, clip_applier, tip_up_fenestrated_grasper, suction_irrigator, bipolar_dissector, and grasping_retractor (Zia et al., 2023).

2. Annotation Schema

Training labels consist of weak, often noisy, clip-level presence labels supplied as CSV files mapping tool types to arms per clip. One label per arm is recorded per clip reflecting continuous tool installation events. Given the potential for tool occlusion, out-of-view tools, or manual mislabeling, noise is inherent in these annotations.

In contrast, the test set receives strong labels. For each frame (at 1 FPS), presence labels and axis-aligned bounding boxes tightly encircle the visible instrument clevis (or tip and shaft, when clevis is ambiguous). Annotations were curated by a cohort of 30 expert annotators, proceeding through two rounds of review to maximize internal consistency and reduce inter-rater divergence. The instrument names displayed on the video user interface are blurred to prevent annotation leakage. The output annotation for spatial localization consists of per-frame dictionaries mapping detected tool classes to their bounding-box coordinates $(x, y, \text{width}, \text{height})$ .

Quality-control measures include automated extraction of presence labels from recorded robot installation/uninstallation events and standardized bounding-boxes conventions, with annotation exceptions (e.g., clevis ill-definition) explicitly documented (Zia et al., 2023).

3. Data Splits and Challenge Protocols

The official split provides 24,695 training clips with only noisy presence labels and 93 test clips with hidden, exhaustively annotated ground truth. No official held-out validation set exists; participants implement their own data partitioning.

Challenge protocols are divided into two categories:

Category 1: Weakly supervised classification—predict per-frame, multi-label tool presence at 1 FPS. Performance is measured using weighted F₁-score across all 14 classes.
Category 2: Weakly supervised detection—predict per-frame presence and bounding boxes. Performance is measured using mean Average Precision (mAP) over multiple Intersection-over-Union (IoU) thresholds (0.5 to 0.95 in increments of 0.05).

Participants submit inference-ready Docker containers for evaluation on the private test set server; rankings are determined by test set performance (Zia et al., 2023).

4. Preprocessing and Data Augmentation

Given the noisiness and granularity of clip-level labels, high-performing solutions universally rely on extensive preprocessing and augmentation:

Preprocessing: Removal of black borders; cropping or blurring UI regions; frame down-sampling from 60 FPS to 1 FPS; resizing to model-dependent spatial dimensions (e.g., 640×512, 375×300, 224×224); pixel normalization using ImageNet statistics.
Augmentation techniques:
- Geometric: Horizontal flips, random crops, rotations, resizing with or without aspect ratio preservation.
- Photometric: Color jitter (HSV), brightness/contrast adjustments, Gaussian blur, noise, fog simulation.
- Mix-based: Mixup, mosaic.
- Advanced: Cutout, RandAugment, channel shuffle, random perspective warp.

This suggests that the dataset’s inherent label noise and lack of dense supervision drive the proliferation of regularization strategies in model training (Zia et al., 2023).

5. Evaluation Metrics

For classification (Category 1), the principal metric is the weighted F₁-score, with per-class weights proportional to that class's prevalence in the test set. For class $c$ :

$\text{Precision}_c = \frac{TP_c}{TP_c + FP_c}, \quad \text{Recall}_c = \frac{TP_c}{TP_c + FN_c}$

$F1_c = \frac{2 \cdot \text{Precision}_c \cdot \text{Recall}_c}{\text{Precision}_c + \text{Recall}_c}$

Overall:

$\text{weighted } F_1 = \frac{\sum_c w_c F1_c}{\sum_c w_c}$

where $w_c$ is the number of positives for class $c$ .

For detection/localization (Category 2), performance is measured by COCO-style mean Average Precision (mAP) across multiple IoU thresholds, with IoU computed as:

$\text{IoU}(\hat{B}, B) = \frac{|\hat{B} \cap B|}{|\hat{B} \cup B|}$

Average Precision at threshold $\tau$ :

$AP(\tau) = \sum_k (\text{Recall}_k - \text{Recall}_{k-1}) \cdot \text{Precision}_k$

Final mAP:

$\text{mAP} = \frac{1}{|T|} \sum_{\tau \in T} AP(\tau), \quad T = \{0.50, 0.55, \dots, 0.95\}$

These choices benchmark both pose-agnostic tool identification and precise spatial localization under weak supervision (Zia et al., 2023).

6. Downstream Use Cases and Extensions

SurgToolLoc’s design supports a range of downstream medical data science tasks. For example, XiCAD, a camera activation detection pipeline, adapts SurgToolLoc for UI-level semantic annotation: 2265 SurgToolLoc videos furnished 7699 manually labeled UI tile crops for model training (labeling “no camera,” “inactive camera,” “active camera”), and 24,983 SurgToolLoc frames were included in frame-level meta-label testing. XiCAD fine-tunes a ResNet18 (pretrained on ImageNet) with weighted binary cross-entropy loss and achieves macroscopic F₁-scores between 0.993 and 1.000 for camera activation detection. No data augmentation was employed for these UI crops; only per-tile normalization was performed (Jenke et al., 25 Nov 2025).

This suggests that SurgToolLoc’s large, well-documented diversity and annotation fidelity directly enable auxiliary annotation pipelines for analysis of intraoperative video metadata.

7. Availability, Licensing, and Future Directions

All SurgToolLoc data, including training/test splits, annotations, and associated challenge materials, are distributed under open licenses as per publication-specific terms. Follow-on datasets and benchmarks (e.g., XiCAD UI labels) are publicly available, as are preprocessing and model inference scripts (Jenke et al., 25 Nov 2025).

Limitations identified in related use cases include a lack of “active camera” samples for rare UI configurations, and single-annotator protocols that preclude inter-rater quantification. Future recommendations include expanding the protocol to multi-annotator labeling, increasing procedural and UI coverage, and extending annotation to additional robotic system variants.

By combining scale (tens of millions of frames), multi-modal labeling, and robust challenge protocols, SurgToolLoc establishes a reference point for weakly supervised computer vision in robot-assisted surgery and supports broad reproducibility and extensibility in surgical data science research (Zia et al., 2023, Jenke et al., 25 Nov 2025).