PitVis-2023 Challenge: Surgical Workflow Recognition

Updated 18 March 2026

PitVis-2023 Challenge is a benchmark competition for surgical workflow recognition, focusing on step and instrument identification in endoscopic pituitary surgery videos.
The challenge utilizes a publicly released, rigorously annotated dataset with spatio-temporal and multi-task deep learning models to address domain-specific complexities.
State-of-the-art approaches demonstrate significant improvements over spatial baselines, setting new standards with enhanced macro-F1 and Edit-score metrics.

The PitVis-2023 Challenge is a benchmark competition in the field of computer vision for surgical workflow recognition, specifically targeting step and instrument identification in videos of endoscopic pituitary surgery. The challenge distinguishes itself from prior minimally invasive surgical video challenges through its domain-specific complexities: restricted surgical workspace, frequent switching among instruments and procedural phases, and the resulting demand for high-precision and temporally-aware model predictions. It was presented as part of the Endoscopic Vision 2023 Challenge at the MICCAI-2023 conference in Vancouver. The publicly released dataset and competition outcomes set new standards for workflow recognition, demonstrating that state-of-the-art spatio-temporal and multi-task deep learning models, when appropriately adapted, outperform spatial-only baselines for this demanding task (Das et al., 2024).

1. Dataset Construction and Annotation

The PitVis-2023 dataset consists of 25 training and 8 testing videos of endoscopic transsphenoidal approach (eTSA) surgeries, sourced from University College London Hospitals (NHNN, UCL) and recorded between 2018–2022. All videos were standardized to 1280×720 resolution at 24 FPS (downsampled if necessary), with frames sampled at 1 FPS as PNGs for model input. Each video underwent de-identification by blurring non-patient regions. The training set presents a mean duration of 72.8 minutes per video (excluding “out-of-patient” frames), capturing significant procedural heterogeneity due to differing surgeon practices and instrumentation over several years.

Annotations involved rigorous cross-validation and consensus: surgical steps were defined by a Delphi process (14 steps, 12 of which had sufficient training samples; steps 1–10, 12, 14 evaluated), while instrument use covered 19 classes (including “no instrument”) with multi-label annotations (up to two instruments per frame). Step labels were independently provided by two neurosurgical trainees, with discrepancies adjudicated by a consultant, and instrument annotations were produced by a third-party service then verified by a neurosurgical trainee/researcher and consultant. Annotation granularity was per second, with structured CSV files containing (video_id, time_s, step_label, inst1_label, inst2_label).

2. Evaluation Metrics, Baselines, and Task Definitions

Three primary tasks were posed:

Task 1 (Step Recognition): Recognize surgical steps, evaluated as the mean of Macro-F1 over 12 steps and Levenshtein-based Edit-score,

$M_1 = \frac{\mathrm{Macro\text{-}F1}_{12\text{-}steps} + \mathrm{Edit\text{-}score}_{12\text{-}steps}}{2}$

Task 2 (Instrument Recognition): Recognize instruments (multi-label, 19 classes), evaluated by Macro-F1,

$M_2 = \mathrm{Macro\text{-}F1}_{19\text{-}instruments}$

Task 3 (Multi-task): Joint step and instrument recognition,

$M_3 = \frac{M_1 + M_2}{2}$

Macro-F1 was computed per standard definitions:

Precision: $P_i = \frac{TP_i}{TP_i + FP_i}$
Recall: $R_i = \frac{TP_i}{TP_i + FN_i}$
F1: $F_i = \frac{2 P_i R_i}{P_i + R_i}$
Macro-F1: $\mathrm{Macro\text{-}F1} = \frac{1}{N} \sum_{i=1}^N F_i$

Edit-score utilized the inverse Levenshtein distance between compressed label sequences [Lea et al. 2016].

Two purely spatial baselines illustrated the challenge’s domain difficulty:

GMAI: TinyViT+EVA-02 ensemble (spatial transformers only); Macro-F1 ≈ 3.7% for steps (no Edit-score consistency), ≈ 27.8% for instruments.
DOLPHINS: XCiT + DenseNet201 CNNs; Macro-F1 ≈ 15.2% for steps.

3. Top-Performing Architectures and Methodologies

High-performing submissions consistently employed spatio-temporal feature modeling and multi-task learning strategies, each addressing specific workflow recognition challenges posed by endoscopic pituitary surgery.

CITI (1st, Task 1 & Task 3):

Stage 1: Spatio-temporal encoder (ST-E) processes 20-frame windows (0.8 s at 24 FPS) with a Swin Transformer backbone and two-layer Multi-Head Self-Attention (MHSA), outputting frame-wise features and preliminary logits.
Stage 2: Autoregressive surgical transformer (ARST) for step decoding, processing 80-frame ST-E features alongside shifted step token embeddings. Architectural innovations include initial MHSA over concatenated queries, mutual MHSA with key/value separation, and frame-wise positional encoding. No post-processing; temporal consistency is learned within ARST. Cross-Entropy loss is equally weighted for steps and instruments, with end-to-end spatio-temporal training.
Domain adaptations include short-term motion capture to resolve visually similar classes and an autoregressive sequence model to propagate surgical context over longer horizons, resulting in ≈20-point Edit-score gain over CNN+LSTM approaches.

TSO-NCT (2nd, Task 1):

Spatial encoding via ConvNeXt-Tiny, followed by LSTM applied to a 512-frame window for temporal propagation, passing softmax logits to preserve history. Step smoothing requires ≥7 consecutive predictions for transition, enforcing robustness against noise. Techniques from the Sufficient Statistics Model [Ban et al. 2021] are leveraged for temporal consistency.

SDS-HD (1st, Task 2):

Ensembles three spatial backbones (ResNet152, EfficientNet-B7, Swin-Large) each paired with temporal LSTM windows. Predictions are balanced-ensembled per backbone. Multi-label binary cross-entropy loss guides training, with explicit data balancing: minority instrument classes are upsampled, dominants downsampled. Augmentations (including CLAHE, colour jitter) and mAP as a subsidiary selection metric support learning for rare classes.

UNI-ANDES-23 (2nd, Task 3):

A dual transformer-based spatio-temporal encoder architecture combines MViT-24 (24-frame videos, global pooling), DINO-24 and Swin-Large (final-frame local/global features). StepFormer and InsFormer modules, each 8-frame 4-layer 8-head attention transformers, decode steps and instruments respectively, with a FusionFormer fusing features. Joint CE/BCE loss trains step and instrument heads, with data balancing and multi-task weighting. Extensive harmonic smoothing (750×) is applied to step probabilities in post-processing.

4. Quantitative Performance and Error Analysis

Task-wise performance, expressed as mean ± standard deviation across 8 private test videos, illustrated large improvements over spatial baselines:

Task	Rank	Team	Metric (M1/M2/M3)
1	1st	CITI	62.9 ± 9.7
1	2nd	TSO-NCT	53.7 ± 11.2
1	3rd	UNI-ANDES-23	48.3 ± 7.3
2	1st	SDS-HD	41.7 ± 15.4
2	2nd	SANO	41.6 ± 6.3
3	1st	CITI	49.0 ± 9.4
3	2nd	UNI-ANDES-23	40.5 ± 7.7

Relative to best spatial-only models, step Macro-F1 improved from ≈3.7% to ≈61.1% (>50 pt gain), and instrument Macro-F1 from ≈27.8% to 41.7% (>10 pt gain).

Failure mode analysis showed most step misclassifications between consecutive phases; step 8 (“haemostasis”) was over-predicted, and steps 3, 6, 9 were hardest (Macro-F1 < 40%). For instruments, errors predominantly involved rare tools misclassified as “no instrument” or “suction,” while dominant classes (0, 16) were reliably distinguished. No formal hypothesis testing for statistical significance was reported; per-video standard deviations exceeding 10 points indicate the necessity for future Wilcoxon or bootstrap evaluation.

5. Comparative Analysis of Spatio-Temporal and Multi-Task Approaches

Spatio-temporal models exhibit clear superiority in surgical workflow recognition for this domain. Temporal modeling permits disambiguation of visually similar but contextually distinct frames, as sequential action and instrument cues resolve short-term ambiguities. Multi-task learning further boosts performance via shared feature learning: instrument cues inform step prediction and vice versa. Online temporal smoothing methods, such as threshold smoothing function (TSF) and harmonic smoothing, provide additional improvements in Edit-score without compromising frame-level metrics.

Compared to spatial-only single-task models, which are limited by static information and suffer from low Edit-score consistency, the combination of spatio-temporal encoding and multi-task prediction delivers state-of-the-art results for both procedural phase and instrument identification.

6. Recommendations and Benchmarking Guidelines

For future workflow recognition challenges, the following practices are recommended:

Distribute standardized, de-identified 1 FPS videos at 720p resolution.
Encourage spatio-temporal transformer or CNN+RNN models with end-to-end temporal feature propagation.
Employ multi-task prediction heads with balanced loss weighting ( $L_{\mathrm{total}} = \alpha L_{\mathrm{step}} + \beta L_{\mathrm{instr}}$ , with $\alpha = \beta = 1$ as default).
Integrate online temporal smoothing (TSF, harmonic) for sequence-level consistency.
Implement data-balancing strategies (upsample rare classes, use weighted sampling) and domain-appropriate augmentations (CLAHE, blur, colour jitter).
Release per-class and minor-class statistics to monitor recognition of frequent and infrequent labels.
Require statistical hypothesis testing (e.g., Wilcoxon, bootstrap) on per-video results to ensure rigor in comparative evaluation.

7. Impact and Outlook

The PitVis-2023 Challenge demonstrates that, when spatio-temporal architectures and multi-task objectives are adapted for domain-specific constraints (e.g., frequent step/instrument switching, occlusion, high class imbalance), substantial improvements accrue for surgical workflow modeling: >50% boost in procedural step Macro-F1 and >10% in instrument Macro-F1 over spatial baselines. The challenge defines a new state-of-the-art for workflow recognition in endoscopic pituitary surgery, establishes robust benchmarking practices, and provides a publicly available dataset to catalyze further methodological advances (Das et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

PitVis-2023 Challenge: Workflow Recognition in videos of Endoscopic Pituitary Surgery (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PitVis-2023 Challenge.