Surgical Action Triplet

Updated 4 February 2026

Surgical action triplet is a structured representation defined as (tool, action, target) that captures detailed surgical interactions.
The model leverages graph-based representations and expert annotation to link tools, actions, and anatomical targets, enhancing context-aware decision support.
Empirical results on Endoscapes-SG201 show that incorporating explicit action edges and hand identity improves recognition performance significantly.

A surgical action triplet is a structured representation used to model the fine-grained interactions taking place within a surgical scene. Formally, a surgical action triplet is defined as an ordered tuple (tool, action, target) that specifies which instrument (tool) is performing what kind of manipulation (action) on which anatomical structure (target). This schema provides a minimal yet semantically dense account of tool–tissue interaction at the frame or instance level, addressing the need for context-aware understanding and downstream decision support in computer-assisted interventions and surgical AI (Shin et al., 21 Jul 2025).

1. Formal Definition and Ontological Structure

A surgical action triplet is expressed as (tool, action, target), with domains:

tool $t_i \in C_\mathrm{tool} = \{$ Hook, Grasper, Clipper, Bipolar, Irrigator, Scissors $\}$
action $a_k \in C_\mathrm{action} = \{$ Dissect, Retract, Grasp, Clip, Coagulate, Null_verb $\}$
target $u_j \in C_\mathrm{anatomy} = \{$ Anatomy₁,…,Anatomy₅ $\}$

Given sets of detected tools $N_\mathrm{tool}$ and anatomical targets $N_\mathrm{anatomy}$ , the space of possible triplets is

$T = \{\, (t_i, a_k, u_j) \mid t_i \in N_\mathrm{tool},\ u_j \in N_\mathrm{anatomy},\ a_k \in C_\mathrm{action} \,\}$

Each object node in scene graph representations is described as $n = (p, s, f)$ , where $\}$ 0 indicates the bounding-box, $\}$ 1 the class-probability, and $\}$ 2 the feature vector derived via RoIAlign.

Hand identity $\}$ 3Rt, Lt, Assi $\}$ 4 is often included to further specify which agent (right/left hand of the surgeon, assistant) manipulates each tool, crucial for disambiguating tool usage contexts (Shin et al., 21 Jul 2025).

2. Dataset Annotation Protocols

The Endoscapes-SG201 dataset exemplifies high-fidelity annotation for action triplets:

Refinement of bounding boxes from broader datasets (e.g., Endoscapes-BBox201) to accurately localize tools and anatomical structures.
Subdivision of generic tool classes into the canonical fine-grained set of six instruments.
For each tool–anatomy pair within a frame, assignment of one action label from the set of six or "Null_verb" to encode explicit absence of interaction.
Explicit annotation of hand identity: Rt (surgeon’s right), Lt (surgeon’s left), or Assi (assistant).
Iterative expert review to ensure label consistency across frames (Shin et al., 21 Jul 2025).

Table: Distributional summary from Endoscapes-SG201

Statistic	Value
Annotated frames	1,933 (train 1,212/val 409/test 312)
Tool instances (total)	Hook 1,060, Grasper 1,598, Clipper 219, Bipolar 142, Irrigator 69, Scissors 4
Action triplets	Train 916, Val 349, Test 208
Action labels (total)	Dissect 916, Retract 1,429, Grasp 95, Clip 207, Coagulate 60, Null_verb 383
Hand identity (total)	Rt 1,543, Lt 1,328, Assi 221

The annotation protocol ensures not just the detection of entities but the precise modeling of their role, context, and agent.

3. Graph-Based Representation of Action Triplets

Surgical action triplets can be modeled as explicit structures within scene graphs. Within SSG-Com (Shin et al., 21 Jul 2025):

Node types:
- $\}$ 5
- $\}$ 6
Edge types:

1. Spatial edges (relative positions: left/right, above/below, inside/outside) 2. Surgical action edges (SAE): $\}$ 7 iff $\}$ 8 and $\}$ 9

Adjacency matrix: $a_k \in C_\mathrm{action} = \{$ 0 if a spatial relation or $a_k \in C_\mathrm{action} = \{$ 1; $a_k \in C_\mathrm{action} = \{$ 2 otherwise
Features: Node features from RoIAlign; edge features from RoIAlign on the union of bounding boxes $a_k \in C_\mathrm{action} = \{$ 3

Message passing is performed with graph convolutional operations:

$a_k \in C_\mathrm{action} = \{$ 4

where $a_k \in C_\mathrm{action} = \{$ 5 contains node features, $a_k \in C_\mathrm{action} = \{$ 6 is the degree matrix, $a_k \in C_\mathrm{action} = \{$ 7 the adjacency encoding spatial and action relations, $a_k \in C_\mathrm{action} = \{$ 8 learnable weights, and $a_k \in C_\mathrm{action} = \{$ 9 ReLU activation.

Classification heads enable:

Action edge prediction via softmax over 6 action classes
Hand identity prediction via softmax over {Rt, Lt, Assi}
Triplet recognition by exhaustively enumerating valid $\}$ 0 triplets, classifying combined tool–anatomy–action edge features

4. Model Training and Loss Functions

The complete model is trained via a composite loss:

$\}$ 1
$\}$ 2
Graph-structural losses $\}$ $}$ 3 (Murali et al.):
- Detection: standard Faster R-CNN loss (classification + box regression)
- Edge existence: BCE for meaningful edge prediction
- Spatial: CE for spatial relation classification

The total loss is

$\}$ 4

with weights $\}$ 5, $\}$ 6.

Training proceeds for 50 epochs per stage, using Faster R-CNN batch and optimizer defaults (weight decay $\}$ 7, learning rate $\}$ 8, batch size $\}$ 9 per GPU), on NVIDIA RTX 3090 hardware.

5. Quantitative Performance and Empirical Findings

On the Endoscapes-SG201 dataset (Shin et al., 21 Jul 2025):

Model	Graph Components	mAP
ResNet50-DetInit	None	9.7
LG-CVS (spatial only)	Yes (spatial)	18.0
SSG-Com (spa + act)	Spatial + action	23.5
SSG-Com (spa + act + hand)	+ hand role	24.2

Incorporating explicit action edges yields a +5.5 mAP increase over spatial-only graphs, and annotating hand identity adds an additional +0.7 mAP. The factorized graph structure (tools as nodes, tool–action–target as typed edges, hand as node attributes) systematically outperforms both naïve image-based recognition and spatial-only graph models.

The explicit modeling of which tool performs which action on which anatomy, and which agent operates the tool, is shown to be a decisive factor in achieving high-precision recognition, especially in the context of downstream tasks (e.g., critical view of safety assessment).

6. Modeling Considerations and Broader Implications

The SSG-Com framework demonstrates that surgical activity understanding benefits from a holistic graph embedding that makes triplet associations and operator roles first-class components. Key observations:

Data representation: Segregation of component entities (tools/anatomy) as nodes, and contextual relations (action, spatial, hand) as edges, supports compositional reasoning and modular graph updates.
Association mechanism: Triplet recognition accuracy is tightly coupled to the explicit encoding of action edges—models lacking this capacity exhibit severe confusion and poor recall for complex scenes.
Human operator context: Hand identity annotation differentiates otherwise ambiguous tool usage, critical for workflow audit and skill assessment.
Dataset dependency: Performance is sensitive to the granularity and consistency of bounding-box and interaction annotations; systematic, expert-led refinement is indispensable.

7. Future Directions and Open Issues

Despite substantial improvements, SSG-Com’s top mAP of 24.2 indicates persistent challenges:

Interaction complexity: Surgical fields with overlapping instruments, uncommon tool–action–target combinations, or rapid exchanges between agents remain underexplored.
Spatial grounding: The method leverages bounding-box annotations for tools and anatomy but does not explicitly handle pixel-level segmentation; finer spatial grounding would benefit from integration with state-of-the-art segmentation architectures.
Temporal consistency: SSG-Com operates at the frame level; coherent action-recognition across sequences, or temporal scene graphs, may further advance the field.
Annotation burden: High-quality action triplet and hand identity annotation is resource-intensive; automated or semi-automated labeling tools are required for scalability.

The graph-based triplet paradigm is central to modern surgical scene understanding, providing a robust substrate for downstream clinical analytics, skill evaluation, and intelligent robotic assistance (Shin et al., 21 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Towards Holistic Surgical Scene Graph (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Surgical Action Triplet.