Video-to-IAT Model for Surgical Workflow

Updated 26 November 2025

Video-to-IAT is a surgical AI system that maps video segments to structured Instrument–Action–Target triplets, facilitating process analysis and automated feedback.
It employs a multi-stage methodology including LLM-based clustering and dictionary mapping to normalize instrument, action, and tissue labels.
Empirical evaluations demonstrate improved AUC metrics and enhanced language model feedback quality in automated intraoperative guidance.

A video-to-IAT model is an artificial intelligence system that maps raw surgical video sequences to structured Instrument–Action–Target (IAT) representations, providing a formalized bridge between perceptual video input and clinically meaningful procedural semantics. These models are foundational to workflow understanding, automated intraoperative feedback, and semantic search in surgical data science, and have been recently formalized to enable alignment of surgical process modeling with modern LLMs and knowledge-driven evaluation (Nasriddinov et al., 19 Nov 2025).

1. Formal Definition and Scope of the Video-to-IAT Model

A video-to-IAT model is a sequence model that, given a surgical video segment, predicts a set of IAT triplets per relevant time interval: $\mathrm{IAT} = \langle I,\,A,\,T\rangle\,,\quad I\in\mathcal{I}\,, A\in\mathcal{A}\,, T\in\mathcal{T}$ where $\mathcal{I}$ is a finite set of canonical instrument labels, $\mathcal{A}$ is a finite set of normalized action (verb) categories, and $\mathcal{T}$ is a restricted set of anatomic or procedural tissue target classes. Any axis component may be "NONE" if unmentioned in the ground truth. The mapping is typically performed on temporal windows (e.g., 10 s) to accommodate action transitions and alignments with manual or automatic feedback.

The IAT ontology developed for these models results from mining and normalization of real trainer-to-trainee speech in robot-assisted surgeries. After iterative clustering, label sets reported are: $|\mathcal{I}| = 7$ instrument clusters, $|\mathcal{A}| = 21$ action clusters, $|\mathcal{T}| = 11$ tissue clusters (Nasriddinov et al., 19 Nov 2025).

The model's output is used for both direct process analysis (structure-aware reasoning, content retrieval, skill assessment) and for downstream conditioning of LLMs to generate procedural or pedagogical feedback (Nasriddinov et al., 19 Nov 2025).

2. Ontology Construction and IAT Normalization

The IAT triplet schema is derived via a multi-stage procedure:

Raw Mention Extraction: Trainer feedback lines are parsed by LLMs (e.g., GPT-4o) to extract candidate instrument, action, and tissue noun/verb phrases; components may be null if omitted in speech (Nasriddinov et al., 19 Nov 2025).
LLM-based Clustering: Extracted mentions are grouped into semantically coherent clusters (fine-grained), followed by higher-level meta-clusters suited for model training. This results in canonical axes (e.g., "buzz," "burn" $\rightarrow$ coagulate) (Nasriddinov et al., 19 Nov 2025).
Pruning: Low-support categories (below thresholds $N_{\text{instr}}>29$ , $N_{\text{action}}>9$ , $N_{\text{tissue}}>24$ ) are removed to ensure robust classification (Nasriddinov et al., 19 Nov 2025).
Mapping: A deterministic dictionary-based mapping normalizes all incoming labels to the canonical set (Nasriddinov et al., 19 Nov 2025).

The resulting ontology is thus a structured subset $M \subseteq \mathcal{I} \times \mathcal{A} \times \mathcal{T}$ , representing the observed and clinically salient triplet space.

3. Model Architecture and Training Methodology

Video-to-IAT models are typically implemented as three multi-class classification heads (one per axis: instrument, action, tissue) trained on synchronized video and IAT annotation pairs.

Key methodologies include:

Context Injection: Incorporating procedure and local task information as side-inputs provides measurable AUC improvements (e.g., Instrument: $0.67 \rightarrow 0.74$ , Tissue: $0.74 \rightarrow 0.79$ ) (Nasriddinov et al., 19 Nov 2025).
Temporal Tracking: Exploiting instrument motion (e.g., via temporally aware CNNs or transformers over video cliplets at 5 fps) further increases recognition quality (Nasriddinov et al., 19 Nov 2025).
Relation to Prior Schemas: Compared to prior action triplet works (e.g., Rendezvous, which uses $<$ instrument, verb, tissue $>$ on a fixed CholecT50 ontology) (Nwoye et al., 2021), the IAT schema is directly grounded in real-world trainer–trainee interactions and is clustered for feedback realism.

Supervision is performed only for observed IATs in the training corpus; axes are managed independently for head training, but the downstream triplet space is pruned according to $M$ (Nasriddinov et al., 19 Nov 2025).

4. Evaluation Metrics and Empirical Performance

Performance is primarily measured per-axis using area under the ROC curve (AUC): $\mathrm{AUC} = \int_0^1 \mathrm{TPR}(u) \, d\mathrm{FPR}(u)$ where $\mathrm{TPR}(u)$ and $\mathrm{FPR}(u)$ are true- and false-positive rates at threshold $u$ .

With context and temporal features, observed gains are:

Instrument AUC: 0.67 $\rightarrow$ 0.74
Action AUC: 0.60 $\rightarrow$ 0.63
Tissue AUC: 0.74 $\rightarrow$ 0.79

For feedback generation (IAT $\rightarrow$ text), LLM-judged fidelity improves with IAT conditioning: mean rubric score from 2.17 (video-only) to 2.44 (+12.4%), with the proportion of admissible ( $\geq$ 3) generations increasing from 21% to 42%. Word error rate (WER) drops by 15–31%, and ROUGE recall increases by 9–64% (Nasriddinov et al., 19 Nov 2025).

5. Illustrative Pipeline: From Video to Feedback via IAT

A comprehensive pipeline is realized as follows:

Extract Video Segment: A temporal window of surgical video is sampled, co-registered to trainer feedback.
Video-to-IAT Prediction: The model predicts $(I, A, T)$ triplets for the window.
Normalization: Model outputs are mapped to canonical classes using the learned ontology dictionaries.
Feedback Generation: Structured IAT triplets condition a LLM (e.g., GPT-4o) to yield clinically relevant, context-grounded textual feedback.

Example 1:

Video region: instrument applies energy to a vein.
Extracted (raw): I = "buzz", A = "buzz", T = "bleeder"
Normalized: I = energy_device, A = coagulate, T = major_veins
Feedback: "Apply controlled energy to the highlighted vein to stop bleeding, ensuring you move slowly to avoid collateral damage." (Nasriddinov et al., 19 Nov 2025)

Example 2:

Video region: left hand retracts peritoneum.
Extracted: I = "left hand", A = "pull up", T = "peritoneum"
Normalized: I = left_hand, A = apply_traction, T = bladder (procedure-specific clustering)
Feedback: "Use your left hand to provide firm, steady retraction on the peritoneum, improving exposure of the bladder neck." (Nasriddinov et al., 19 Nov 2025)

6. Ontological Context and Position within Surgical Data Science

The video-to-IAT paradigm is part of a broader shift toward ontologically structured, machine-actionable representations in surgical AI. Ontologies such as the mid-level Surgical Data & Algorithm Ontology (Katić et al., 2017), motion-primitive-based taxonomies (e.g., COMPASS FSM of context and MPs) (Hutchinson et al., 2022), and action triplet schemas (CholecT50) (Nwoye et al., 2021) provide mathematical and relational underpinnings for action detection, process modeling, and autonomous feedback systems.

Notably, IAT is directly derived from the language of expert feedback, enabling alignment with natural-language processing and transparent, auditable justifications for automated instruction (Nasriddinov et al., 19 Nov 2025). This supports the rigorous evaluation of automated feedback's clinical validity, going beyond surface-level text metrics to domain-aware critique.

A plausible implication is that as these models and their ontologies become more widely adopted, interoperable, and standardized (potentially using formal OWL/RDF encodings), they will facilitate not only robust intraoperative guidance and assessment but also semantic search, comparison, and reasoning across surgical datasets and algorithmic workflows.