Instrument Segmentation in Surgical Imaging

Updated 18 March 2026

Instrument segmentation is the pixel-level delineation of surgical tools in medical images, critically supporting robotic procedures and augmented reality overlays.
Techniques leverage encoder-decoder networks, transformers, and promptable vision-language models to enhance accuracy and real-time performance.
Progress is bolstered by annotated benchmarks like EndoVis and metrics such as IoU and Dice, which validate robustness amid domain shifts and occlusions.

Instrument segmentation is the computational task of delineating and labeling surgical instruments at the pixel (or voxel) level within medical images or video sequences acquired during minimally invasive or robot-assisted procedures. Accurate instrument segmentation is fundamental for enabling context-aware robotic surgery, real-time augmented reality overlays, surgical workflow analysis, pose estimation, and intraoperative guidance systems. The field encompasses semantic, instance, and part-level segmentation—each with distinct labeling and application requirements—and has accelerated rapidly with the advent of annotated benchmarks, deep learning, and transformer architectures.

1. Problem Definition and Task Taxonomy

Instrument segmentation encompasses several established sub-tasks, each defined by the target semantic granularity:

Binary segmentation: Each pixel is classified as either instrument or background (anatomy, surgical debris, etc.) (Allan et al., 2019).
Parts segmentation: Within instrument-occupied pixels, further assign component labels (e.g., shaft, wrist, jaws/clasper) (Allan et al., 2019).
Type/class segmentation: Instrument pixels are labeled according to specific device classes (such as monopolar scissors, bipolar forceps, needle driver, etc.) (Allan et al., 2019).
Instance segmentation: Each physically distinct instrument, even among the same class, is assigned a unique mask (Ross et al., 2020, González et al., 2020).
Referring segmentation and prompt-based segmentation: Pixels are masked based on natural language or class prompts, enabling interactive or open-vocabulary tool identification (Zhou et al., 2023, Yue et al., 2023, Yue et al., 2023, Wang et al., 2023).

The primary input modalities are monocular or stereo RGB endoscopic frames; 3D modalities (ultrasound, CT) are active research areas (Yang et al., 2021). Common output representations are binary or multi-class raster masks, often at full video resolution.

2. Datasets, Annotation Protocols, and Benchmarks

Progress in instrument segmentation has been propelled by several annotated, publicly released datasets and competitive challenges:

EndoVis 2015/2017/2018/2019 (MICCAI): The Endoscopic Vision (EndoVis) Robotic Instrument Segmentation Challenges have defined reference problems and splits:
- 2017: 10 da Vinci Xi stereo videos (porcine), 1,800 annotated training frames, 1,200 test frames; pixel-wise masks for binary, part, and type segmentation with hand-annotated ground truth (Allan et al., 2019).
- 2018: Extended to full scene segmentation—tissue classes, clips, and more anatomical diversity (Rueckert et al., 2023).
- 2019 (ROBUST-MIS): 10,040 annotated frames from 30 human laparoscopic procedures, with explicit focus on robustness and cross-domain generalization; multi-instance and multi-task benchmarks (Ross et al., 2020).
Synthetic datasets: Rendered and style-transferred laparoscopic scenes enable sim-to-real transfer and annotation-efficient training (Sahu et al., 2020).
Other modalities: Datasets in 3D ultrasound (US) and ex-vivo robot kinematic environments have expanded the landscape (Yang et al., 2021).

Annotation protocols in top-tier benchmarks combine structured labeling workflows (polygon tool annotation, per-instance masks, consensus curation) and periodic quality control. Challenges for annotation include motion blur, occlusions, small or transparent tool parts, and the lack of polygon "hole" encoding for elements such as clasper apertures (Allan et al., 2019, Ross et al., 2020).

3. Methodological Evolution and Key Architectures

Encoder–decoder convolutional networks with skip connections (U-Net, TernausNet, ResNet variants) formed the early foundation for high-quality instrument segmentation (Shvets et al., 2018, Allan et al., 2019, Xia et al., 2023). Subsequent innovations include:

Pretrained encoders: Adopting VGG or ResNet backbones trained on ImageNet boosts low-data performance (Allan et al., 2019, Pakhomov et al., 2020, Shvets et al., 2018).
Lightweight, efficient architectures: Pruned and dilated ResNets achieve real-time performance (up to 125 FPS at megapixel scale) with minimal accuracy trade-off (Pakhomov et al., 2020).
Temporal and motion-based priors: Explicit exploitation of motion flow (optical flow warping, temporal priors) enhances segmentation stability under occlusion, tool motion, and ambiguous video frames (Jin et al., 2019, Zhao et al., 2021).
Instance-aware architectures: Mask R-CNN and related region-based methods ensure physically consistent masks per instrument and enable robust instance segmentation (González et al., 2020, Ross et al., 2020).
Transformers and query-based decoders: Modern masked-attention transformers (e.g., Mask2Former, Swin, and video transformers such as TAPIS or MViT) improve spatial and spatio-temporal context modeling, especially for instance and multi-task segmentation (Ayobi et al., 2023, Wang et al., 2024).
Promptable and vision-language approaches: Foundation models like SAM and CLIP have been adapted to surgical domains via prompt encoders, contrastive prototype learning, and hybrid vision-language pipelines, enabling class-promptable and text-driven segmentation (Yue et al., 2023, Zhou et al., 2023, Yue et al., 2023). Part-level collaborative prompting sets new benchmarks for structure-aware segmentation (Yue et al., 2023).

4. Evaluation Protocols and Empirical Benchmarks

Standardized metrics allow robust cross-method comparison:

Intersection-over-Union (IoU) / Jaccard index: For mask $P$ and ground-truth $G$ ,

$\mathrm{IoU}(P, G) = \frac{ |P \cap G| }{ |P \cup G| }$

Dice Similarity Coefficient:

$\mathrm{Dice}(P, G) = \frac{2\,|P \cap G|}{|P| + |G|}$

Mean / macro-IoU, mean class IoU, Normalized Surface Dice, and mAP: For multi-instance or detection tasks, instance-level matching (Hungarian algorithm) is used (Ross et al., 2020, González et al., 2020).
Performance ranges: State-of-the-art approaches achieve mean IoU ≈ 0.88–0.89 for binary/part segmentation on EndoVis 2017, with notable drop for instrument-type segmentation (best ≈0.54 mean IoU) (Allan et al., 2019, Xia et al., 2023). Instance mask [email protected] reaches 0.60–0.75 for multi-instance tasks (Ross et al., 2020).

Temporal fusion (transformers, motion-flow priors) consistently improves video sequence performance, yielding both higher accuracy and more robust predictions under dynamic surgical scenes (Ayobi et al., 2023, Jin et al., 2019).

5. Domain Adaptation, Unsupervised, and Semi-supervised Methods

Given high annotation costs and domain variability (hospital, instrument, lighting, anatomy), research has increasingly targeted annotation-efficient and domain-general instrument segmentation:

Meta-learning and online adaptation: MDAL achieves high IoU in new domains with adaptation using only the first annotated frame of a video and gradient-gated online pseudo-labeling, outperforming non-adapted and flow-based adaptation baselines (Zhao et al., 2021).
Consistency learning: Endo-Sim2Real bridges simulated-to-real transfer via consistency constraints (cross-entropy + Jaccard) between predictions under augmentations, eliminating the need for real annotations and matching Cycle-GAN approaches in performance (Sahu et al., 2020).
Unsupervised segmentation: Anchor generation (color/objectness/location cues) and semantic diffusion constraints enable competitive binary segmentation in the total absence of labels, achieving ≈0.71 IoU / 0.81 Dice on EndoVis 2017 (Liu et al., 2020).
Semi-supervised learning in 3D US: Dual-UNet with hybrid uncertainty and contextual constraints enables high performance (Dice ≈69%) using only a fraction of the fully labeled volumes (Yang et al., 2021).

These methods close much of the gap to fully supervised performance, especially under domain and data-limited scenarios.

6. Robustness, Generalization, and Outstanding Challenges

Robust deployment of instrument segmentation faces several documented obstacles:

Domain generalization: Performance declines under unseen procedures (domain shift), but leading architectures (OR-UNet, Mask R-CNN, DeepLabV3+) degrade modestly (e.g., DSC drop from 0.88 to 0.85) in ROBUST-MIS stage 3 (Ross et al., 2020, Wilms et al., 2022).
Small and overlapping instances: Multi-instance segmentation accuracy drops steeply with >2 instruments present; techniques like dense pyramid attention, multi-scale attention, or structure-aware prompting provide only partial remedy (Ross et al., 2020, Yue et al., 2023).
Specular highlights, smoke, and occlusion: All present ongoing sources of segmentation error, especially on fine tool tips and small parts. Approaches like synthetic augmentation of specularity or multi-angle feature aggregation improve but do not eliminate such artifacts (Qin et al., 2020).
Annotation limitations: Single-annotator noise, inability to encode holes or ambiguous boundaries, and sparse sampling of rare classes motivate the adoption of more robust annotation and label-propagation pipelines (Allan et al., 2019).
Inference speed and real-time constraints: High-throughput, low-latency architectures (lightweight ResNets, Mask2Former, efficient promptable methods) are increasingly crucial for intraoperative deployment (Pakhomov et al., 2020, Yue et al., 2023).

Active directions include promptable open-vocabulary methods, part-to-whole segmentation, video-level temporal consistency, and multimodal data fusion.

7. Trends, State-of-the-Art, and Future Directions

Instrument segmentation has transitioned from static, per-frame U-Net models through deep residual and instance-aware architectures to prompt-driven, transformer-based, and multimodal vision-language foundation models. Notable advances include:

Transformers and promptable models are now at the empirical frontier, with Mask2Former variants, MATIS, and LACOSTE achieving state-of-the-art segmentations in both semantic and instance-aware tracks, and hybrid prompt encoders (e.g. Collaborative Prompts in SP-SAM) nearly matching fully supervised baselines at a fraction of tuning cost (Ayobi et al., 2023, Yue et al., 2023, Wang et al., 2024).
Domain adaptation and unsupervised learning: Consistency-based, shape-focused, and semantic-diffusion methods have demonstrated competitive performance without labeled data, suggesting a viable path to annotation-light, domain-robust instrument perception (Sahu et al., 2020, Liu et al., 2020).
Referring and open-vocabulary segmentation: Vision-language approaches support interactive and language-driven workflows, promising greater flexibility and integration in surgical settings (Zhou et al., 2023, Yue et al., 2023, Yue et al., 2023, Wang et al., 2023).
Video and stereo context: Leveraging temporal and stereo cues, e.g. with disparity-guided feature fusion and set-classification transformers, has improved segmentation stability under challenging intraoperative conditions (Wang et al., 2024).

Persistent research challenges include robust small-part and multi-instance separation, label-efficient adaptation to new domains and tools, and real-time deployment in complex, dynamic clinical environments. Ongoing developments in foundation model fine-tuning, multimodal fusion, and part-aware architectures are driving the field toward higher utility, generalizability, and clinical readiness.