Surgical Workflow Recognition

Updated 18 March 2026

Surgical workflow recognition is the automated segmentation of intraoperative data into phases, steps, and actions to support clinical decision-making.
It employs advanced spatial and temporal models, including CNNs, LSTMs, and transformers, to address challenges like temporal dependencies and inter-class ambiguities.
The approach integrates multimodal inputs—video, kinematics, and audio—to enhance accuracy, interpretability, and robustness in dynamic surgical environments.

Surgical workflow recognition is the computational task of automatically parsing surgical procedures—typically from intraoperative video or multimodal data—into a sequence of semantically meaningful steps (phases, actions, instrument usages). This capability underpins a wide spectrum of computer-assisted intervention (CAI) technologies, including intraoperative decision support, cognitive surgical assistance systems, workflow optimization, skill assessment, and automated documentation. Contemporary research leverages supervised deep learning, temporal modeling, and multimodal representation learning to address challenges such as temporal dependencies, inter-class ambiguities, and robustness in unconstrained surgical environments.

1. Problem Formulation and Datasets

Surgical workflow recognition most commonly targets three hierarchical granularity levels: phases (coarse divisions), steps/tasks (procedure-specific subunits), and actions (instrument–verb–target triplets). Formally, given an input data stream $X = \{x_t\}_{t=1}^T$ (where $x_t$ may be an RGB frame, sensor sample, kinematic vector, etc.), the goal is to assign ground-truth labels $y_t$ per time step $t$ from a fixed vocabulary (phases, steps, triplets).

Benchmark datasets are central to reproducible evaluation:

Dataset	Surgery Type	N Videos	Granularity	Modalities	Notable Features
Cholec80	Laparoscopic cholecystectomy	80	Phases	Video	7 phases, widely adopted
M2CAI16	Laparoscopic cholecystectomy	41	Phases	Video	8 phases
CAT-SG	Cataract surgery	50	Steps	Scene graphs	19 steps, DSG annotation
AutoLaparo	Hysterectomy	21	Phases	Video	7 phases, challenging sequence
LLS48	Left lateral sectionectomy	48	Phase/Step/Action	Video	Action-triplet (IVT)
MISAW	Microsurgical anastomosis	27	Phase/Step/Activity	Video/Kinematics	Stereoscopic, hierarchical
PmLR50	Liver resection (Pringle)	50	Phases	Video	Real-time/ischemia detection

Dataset annotations may include tool presence, instrument kinematics, voice commands, or bounding boxes of anatomy. Standard splits for supervised training/testing are applied, and metrics include per-frame accuracy, macro F1, Jaccard index per phase, and task-specific temporal continuity measures.

2. Architectural Paradigms

State-of-the-art methods are generally modular, comprising a spatial feature extractor and a temporal modeling component:

Spatial Backbone: Modern pipelines use 2D/3D CNNs (e.g., ResNet-50, EfficientNet, Swin-Transformer, Vision Transformer), pretrained on large-scale datasets (ImageNet, Kinetics) and fine-tuned on surgical data. Recent approaches utilize pre-trained vision–LLMs such as CLIP, leveraging prompt learning for domain-specific representation (Kondo, 19 May 2025).
Temporal Modeling: Recurrent models (LSTM, GRU), temporal convolutional networks (MS-TCN, TCN), and transformer-based architectures (MRTT, self-attention) aggregate feature sequences for temporal smoothing and phase prediction. Causal and acausal variants exist, with dilated convolutions enabling large temporal receptive fields (Demir et al., 2024, Jeon et al., 18 Jan 2026).
Joint and Hierarchical Models: Multi-stage frameworks (e.g., CurConMix+) apply curriculum-guided contrastive learning at increasing granularity, then multi-resolution temporal transformers for phase, step, or action recognition (Jeon et al., 18 Jan 2026). Graph-based models represent surgical scenes as dynamic graphs and use graph neural networks for explicit modeling of object–object and object–action relationships (Holm et al., 16 Dec 2025).
Multimodal Fusion: Integration of video, kinematics, speech, and scene graphs is achieved via gated fusion modules, adversarial feature alignment, and joint graph neural architectures (Bai et al., 3 May 2025, Demir et al., 2024). Multimodal systems demonstrate superior robustness to occlusion, sensor noise, and adverse visual conditions.

3. Temporal Context and Ambiguity Resolution

Robust recognition of phase transitions and ambiguous states is achieved through explicit long-range temporal context and mechanisms for uncertainty modeling:

Memory-Augmented Networks: Networks such as TMRNet and SSM-LSTM use external memory banks or sufficient-statistics modules to store supportive features from the long-range past, which are then selectively attended to via self-attention operators (Jin et al., 2021, Ban et al., 2020).
Dual-Pathway and Prototype-Based Models: DSTED decouples temporal stabilization (via Reliable Memory Propagation) and discriminative enhancement (via Uncertainty-Aware Prototype Retrieval), adaptively fusing their outputs using a confidence-driven gate. The UPR pathway maintains learnable class-specific prototype banks, and ambiguous frames are refined by matching against these prototypes (Chen et al., 22 Dec 2025).
Contrastive Instance Separation: Contrastive Prototype Separation and label-regularized losses (as in PmNet or CurConMix+) increase the inter-class margin and reduce confusion between visually similar intraoperative actions (Guo et al., 2024, Jeon et al., 18 Jan 2026).
Uncertainty and Stochastic Modeling: CoStoDet-DDPM introduces a stochastic diffusion branch during training, injecting uncertainty into deterministic phase-recognition; this reduces collapse to dominant patterns and increases generalization to anatomical and procedural variations (Yang et al., 13 Mar 2025).

4. Interpretability, Robustness, and Data Efficiency

Novel directions address model transparency, practical robustness under real-world conditions, and annotation bottlenecks:

Explainable Recognition: Concept-based explanation frameworks such as SurgX systematically associate model neurons with interpretable surgical concepts derived from curated vocabularies, and trace prediction logic through concept–neuron relationships. This supports “right to explanation” and failure analysis (Kim et al., 21 Jul 2025).
Graph-Based and Scene-Centric Models: Interpretable scene-graph-based models (ProtoFlow) cluster dynamic scene graphs into prototypes, enabling few-shot adaptation and node-level diagnosis of workflow deviations and rare complications (Holm et al., 16 Dec 2025).
Multimodal Robustness: Adversarial feature disentanglement and graph attention mechanisms are applied to handle missing, corrupted, or noisy sensor inputs (e.g., video occluded by bleeding or device noise), maintaining high performance and graceful degradation (Bai et al., 3 May 2025).
Active Data Selection: Long-Range Temporal Dependency (LRTD) based active learning identifies “hard” clips with weak intra-clip dependencies for annotation, enabling >85% performance with only 50% labeled data (Shi et al., 2020).

5. Cross-Procedure Generalization and Hierarchical Understanding

Transferring workflow recognition across different procedures, and scaling to fine-grained hierarchical analysis, are addressed as follows:

Cross-Surgery Transfer: The Time-Series Adaptation Network (TSAN), combined with self-supervised sequence sorting, enables transfer from cholecystectomy to hemicolectomy, sleeve gastrectomy, or appendectomy, surpassing per-procedure baselines with only 100–200 labeled target videos (Neimark et al., 2021).
Hierarchical and Triplet-Level Recognition: Unified frameworks such as CurConMix+ jointly address step, task, and action-level recognition via curriculum contrastive learning and multi-level annotation (LLS48). Features learned for action-triplet recognition are shown to transfer effectively for higher-level step and phase tasks (Jeon et al., 18 Jan 2026).
Granularity Limits: Empirical studies (MISAW) demonstrate robust phase recognition (>95% balanced accuracy), adequate step recognition (>80%), but low (<60%) activity-level accuracy, motivating research on improved representations and hierarchy modeling for clinical use (Huaulmé et al., 2021).

6. Quantitative Performance Landscape

Empirical benchmarks distinguish the current performance envelope across datasets and models:

For Cholec80, TCN/GRU/Transformer models achieve 82–93% frame-wise accuracy and Jaccard indices above 68% for phase recognition, with best results from multi-stage, memory-augmented, or pretrained vision-LLMs (Kondo, 19 May 2025, Jeon et al., 18 Jan 2026, Jin et al., 2021).
PmNet achieves 95.89% accuracy and phase-averaged F1 ≈ 82.6% for workflow detection in Pringle-maneuver liver resections, surpassing prior TMRNet (+3% accuracy, +15% F1), and integrating real-time ischemia detection (Guo et al., 2024).
DSTED increases AutoLaparo frame-level accuracy by +3.51% (to 84.36%) and macro-F1 by +4.88% (to 65.51%) over prior models, with substantial jitter reduction (Chen et al., 22 Dec 2025).
Cross-procedure TSAN with sequence sorting exceeds 90% accuracy on target procedures with minimal annotated samples, demonstrating scalable adaptation (Neimark et al., 2021).
ProtoFlow yields robust few-shot performance (e.g., 39.5% accuracy on CAT-SG with 1-from-50 videos, vs. GATv2 at 15.4%), supporting interpretable reasoning and rare event detection (Holm et al., 16 Dec 2025).

7. Open Challenges and Future Directions

Key bottlenecks and potential research avenues include:

Fine-grained activity (triplet/gesture) recognition remains below clinical thresholds, limited by label imbalance, visual similarity, and annotation noise (Huaulmé et al., 2021, Jeon et al., 18 Jan 2026).
Scalability to multi-center, multi-surgeon, and multi-institutional data requires generalized models robust to domain shift and procedural heterogeneity (Wagner et al., 2021).
Integrating multi-modal streams (video, tools, kinematics, audio, intra-operative signals) and graph-based hypotheses promises further gains, particularly under corruption or occlusion (Bai et al., 3 May 2025, Demir et al., 2024).
Zero- or few-shot adaptation to new surgical procedures via prompt tuning or scene-graph prototypes is an emerging goal (Kondo, 19 May 2025, Holm et al., 16 Dec 2025).
Real-time, interpretable, and uncertainty-aware workflow recognition will be central to deploying AI in safety-critical surgical contexts, demanding TCN/Transformer/graph architectures augmented with reliability estimation and clinical-facing explanations (Chen et al., 22 Dec 2025, Kim et al., 21 Jul 2025).

The convergence of memory-augmented temporal models, multimodal representation, conceptual interpretability, and data-efficient learning delineates the current frontier and research trajectory in surgical workflow recognition.