Papers
Topics
Authors
Recent
2000 character limit reached

DepSeg: Depth-Guided Surgical Segmentation

Updated 12 December 2025
  • The paper presents a modular depth-guided segmentation approach using geometric operators and template-based classification to reduce annotation effort.
  • It details how monocular depth priors and RGB-D fusion architectures are used to disambiguate challenging surgical scene features.
  • Quantitative analyses reveal DepSeg’s annotation efficiency versus higher mIoU achieved by RGB-D methods like SurgDepth under full supervision.

Depth-guided surgical scene segmentation (DepSeg) encompasses computational frameworks for partitioning images of surgical scenes into semantically distinct regions using depth information as a principal cue, notably improving segmentation quality and annotation efficiency in minimally annotated or high-complexity scenarios. DepSeg approaches leverage monocular depth priors or explicit RGB-D fusion to guide object proposals and disambiguate visually challenging structures, capitalizing on both geometric and appearance-based features (Yang et al., 5 Dec 2025, Jamal et al., 29 Jul 2024).

1. Monocular Depth-Guided Segmentation Frameworks

Training-free depth-guided segmentation, as exemplified by the DepSeg pipeline (Yang et al., 5 Dec 2025), operates in a strictly modular, non-end-to-end manner. The input is a single RGB frame xRH×W×3x \in \mathbb{R}^{H \times W \times 3}; a monocular depth estimator (e.g., DepthAnythingV2) infers a dense relative depth map d(x,y)=fdepth(x)RH×Wd(x, y) = f_\text{depth}(x) \in \mathbb{R}^{H\times W}. This map is median-centered on a per-image basis: drel(p)=d(p)medianqΩ{d(q)},  pΩ.d_\mathrm{rel}(p) = d(p) - \mathrm{median}_{q \in \Omega}\{d(q)\}, \; p \in \Omega.

Spatial segmentation seeds are extracted by a deterministic geometric operator GG acting on dreld_\mathrm{rel}:

  • K-means clustering in depth space to group pixels by contiguous depth,
  • Canny edge detection and Otsu thresholding to delineate boundaries,
  • Watershed segmentation to split overlarge regions.

Morphological closing/opening produces cleaned candidate regions R~m\tilde R_m. For each, the Euclidean distance transform Dm(p)D_m(p) finds local maxima that serve as point prompts SmS_m, spaced at least dmind_\mathrm{min} apart and away from the image border.

Point prompts S=mSmS = \bigcup_m S_m are consumed by a pretrained, frozen SAM2 image predictor, yielding soft score maps per prompt. Binary masks Mi(0)(p)=1{si(p)>τs}M^{(0)}_i(p) = \mathbf{1}\{s_i(p) > \tau_s\} are extracted, morphologically filtered, and merged via a small-area-first priority scheme to yield a non-overlapping set M^={Mj}j=1J\widehat{\mathcal{M}} = \{M_j\}_{j=1}^J.

2. Template-Based Mask Classification

Region descriptors are computed for each MjM_j by masked average pooling of DINOv3 vision foundation model tokens. A template bank Tc\mathcal{T}_c is assembled offline by pooling DINOv3 features over class-masked regions in NN annotated frames: hn,c=1Zn,cu,vMn,c(u,v)T(n)(u,v,:),Zn,c=u,vMn,c(u,v)h_{n, c} = \frac{1}{Z_{n, c}} \sum_{u,v} M'_{n, c}(u,v) T^{(n)}(u,v,:), \quad Z_{n,c} = \sum_{u,v} M'_{n, c}(u,v) with L2 normalization. Per-class semantic assignment for predicted masks proceeds by cosine similarity matching to templates, summing the top-kk template scores Sj,cS_{j, c} and assigning class c^j=argmaxcSj,c\hat c_j = \arg\max_c S_{j, c}. The final pixelwise map y=F(x)y = F(x) results from merged labeled masks.

3. RGB-D Fusion Architectures for Surgical Segmentation

Alternatives to prompt-driven segmentation include end-to-end trainable RGB-D fusion networks. SurgDepth (Jamal et al., 29 Jul 2024) exemplifies this paradigm, integrating RGB and predicted depth maps in a joint ViT encoder–decoder scheme. The architecture comprises:

  • Parallel projection heads for RGB and depth,
  • A lightweight “3D awareness” fusion block performing attention between concatenated RGB/depth features and RGB-alone keys/values, with outputs upsampled and linearly separated back into modality-specific streams,
  • A ViT-B encoder (pretrained via image MAE objectives),
  • A shallow ConvNeXt decoder.

The fusion block pools [Xirgb;Xidepth][X_i^{\rm rgb}; X_i^{\rm depth}] to k×kk \times k grids, computes queries, keys, values via 1×11 \times 1 projections, applies attention, and forms modality-updated representations: Q=Wq(Poolk×k([Xirgb,Xidepth])), K=Wk(Xirgb),V=Wv(Xirgb), A=Softmax(QKCd), Xfusion=Upsampleh,w(VA), (X^irgb,X^idepth)=(FCrgb,FCdepth)(Xfusion).\begin{aligned} Q &= W_q \left(\mathrm{Pool}_{k \times k}\left([X_i^{\rm rgb}, X_i^{\rm depth}]\right)\right),\ K &= W_k(X_i^{\rm rgb}),\quad V = W_v(X_i^{\rm rgb}),\ A &= \mathrm{Softmax}\left(\frac{Q^\top K}{\sqrt{C_d}}\right),\ X_\mathrm{fusion} &= \mathrm{Upsample}_{h, w}(V A),\ (\widehat{X}_i^{\rm rgb}, \widehat{X}_i^{\rm depth}) &= (\mathrm{FC}_{\rm rgb},\,\mathrm{FC}_{\rm depth})(X_\mathrm{fusion}). \end{aligned}

The final pixelwise semantic logits are trained using cross-entropy loss: Lseg=1HWu=1Hv=1Wc=1Cyu,v,clogpu,v,c\mathcal{L}_{\rm seg} = -\frac{1}{HW} \sum_{u=1}^H \sum_{v=1}^W \sum_{c=1}^C y_{u, v, c} \log p_{u, v, c}

4. Quantitative Performance and Annotation Efficiency

Performance metrics are consistently reported via mean intersection-over-union (mIoU): IoUc=TPcTPc+FPc+FNc,mIoU=1Cc=1CIoUc.\mathrm{IoU}_c = \frac{\mathrm{TP}_c}{\mathrm{TP}_c + \mathrm{FP}_c + \mathrm{FN}_c}, \quad \mathrm{mIoU} = \frac{1}{C}\sum_{c=1}^C \mathrm{IoU}_c.

In the CholecSeg8k dataset (Yang et al., 5 Dec 2025):

Method mIoU (%)
Mask2Former (supervised) 78.4
SAM2 + 8D cosine-matching 14.7
SAM2 + DINOv3 template-match 31.6
DepSeg (depth prompts, full) 35.9

Annotation efficiency is achieved by template selection: using only 10%20%10\%-20\% of templates in the bank yields 31.2%33.6%31.2\%-33.6\% mIoU (vs.\ 35.9%35.9\% with all templates), thus reducing annotation effort by 80%90%\sim80\%-90\% with minimal quality loss. No further model finetuning is necessary; only template features are precomputed and stored.

In RGB-D fusion paradigms, SurgDepth achieves the following SOTA mIoU values (Jamal et al., 29 Jul 2024):

Dataset SurgDepth IoU
SAR-RARP50 0.862
CholecSeg8k 0.556
LapI2I 0.981
EndoVis2017 0.613 (challenge IoU)

Early fusion outperforms late fusion, and increasing ConvNeXt decoder depth past four blocks does not boost accuracy.

5. Analysis of Methodological Differences

DepSeg and SurgDepth represent the two principal strategies in depth-guided segmentation: training-free geometric prompting with template-based labeling versus end-to-end deep RGB-D joint encoding. DepSeg is annotation-efficient, requiring only a template bank and frozen pretrained networks, with no additional model training or finetuning. The procedure is stable, deterministic, and modular, but yields a significantly lower upper-bound mIoU compared to fully-supervised or joint RGB-D training. SurgDepth, by contrast, achieves higher absolute accuracy (near-supervised SOTA) but at the cost of model retraining, complete dataset-specific supervision, and heavier computational resources.

A plausible implication is that DepSeg methods are best suited for rapid, low-resource bootstrapping scenarios or where annotation cost is prohibitive, while RGB-D fusion methods dominate where dense ground truth is available and maximizing segmentation accuracy is critical.

6. Limitations and Future Directions

Current depth-guided segmentation faces several critical challenges:

  • Rare class performance: Both DepSeg and RGB-D models assign near-zero IoU to rare structures (e.g., hepatic veins (Yang et al., 5 Dec 2025)).
  • Depth estimation domain gaps: Monocular depth networks, often pretrained on natural images, degrade under strong specularities, blood, and tissue deformation prevalent in surgical scenes (Jamal et al., 29 Jul 2024).
  • Temporal consistency: Existing DepSeg pipelines operate frame-wise; no temporal smoothing or track-based mask refinement is applied.
  • Template selection and coverage: Uniform template sampling fails for fine-grained or underrepresented anatomical categories.

Potential improvements include smarter active sampling for templates, integration of temporal cues (e.g., recurrent or video-based modules), leveraging metric depth/3D priors if available, and introducing lightweight finetuning to adapt pretrained segmenters to surgical domains. The plug-and-play structure of RGB-D fusion blocks further facilitates extension to multi-modal and multi-view surgical scene parsing.

7. Context Within the Broader Field

The introduction of depth-guided prompting and template-based classification (Yang et al., 5 Dec 2025) aligns with ongoing efforts to reduce annotation burden in medical imaging and to exploit geometric cues unavailable to RGB-only segmenters. SurgDepth (Jamal et al., 29 Jul 2024) demonstrates that ViT-based encoder–decoder models with explicit depth fusion can achieve or exceed the state of the art on multiple public benchmarks with modest increases in computational cost. Both methods underscore the importance of robust depth priors for improving mask delineation in challenging scenarios with poor color contrast, specular artifacts, and occlusions.

Depth-guided segmentation continues to evolve, with open challenges including real-time deployment, robustness to domain shift, and construction of comprehensive, calibrated RGB-D surgical datasets for further benchmarking and development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Depth-Guided Surgical Scene Segmentation (DepSeg).