DepSeg: Depth-Guided Surgical Segmentation

Updated 12 December 2025

The paper presents a modular depth-guided segmentation approach using geometric operators and template-based classification to reduce annotation effort.
It details how monocular depth priors and RGB-D fusion architectures are used to disambiguate challenging surgical scene features.
Quantitative analyses reveal DepSeg’s annotation efficiency versus higher mIoU achieved by RGB-D methods like SurgDepth under full supervision.

Depth-guided surgical scene segmentation (DepSeg) encompasses computational frameworks for partitioning images of surgical scenes into semantically distinct regions using depth information as a principal cue, notably improving segmentation quality and annotation efficiency in minimally annotated or high-complexity scenarios. DepSeg approaches leverage monocular depth priors or explicit RGB-D fusion to guide object proposals and disambiguate visually challenging structures, capitalizing on both geometric and appearance-based features (Yang et al., 5 Dec 2025, Jamal et al., 29 Jul 2024).

1. Monocular Depth-Guided Segmentation Frameworks

Training-free depth-guided segmentation, as exemplified by the DepSeg pipeline (Yang et al., 5 Dec 2025), operates in a strictly modular, non-end-to-end manner. The input is a single RGB frame $x \in \mathbb{R}^{H \times W \times 3}$ ; a monocular depth estimator (e.g., DepthAnythingV2) infers a dense relative depth map $d(x, y) = f_\text{depth}(x) \in \mathbb{R}^{H\times W}$ . This map is median-centered on a per-image basis: $d_\mathrm{rel}(p) = d(p) - \mathrm{median}_{q \in \Omega}\{d(q)\}, \; p \in \Omega.$

Spatial segmentation seeds are extracted by a deterministic geometric operator $G$ acting on $d_\mathrm{rel}$ :

K-means clustering in depth space to group pixels by contiguous depth,
Canny edge detection and Otsu thresholding to delineate boundaries,
Watershed segmentation to split overlarge regions.

Morphological closing/opening produces cleaned candidate regions $\tilde R_m$ . For each, the Euclidean distance transform $D_m(p)$ finds local maxima that serve as point prompts $S_m$ , spaced at least $d_\mathrm{min}$ apart and away from the image border.

Point prompts $S = \bigcup_m S_m$ are consumed by a pretrained, frozen SAM2 image predictor, yielding soft score maps per prompt. Binary masks $M^{(0)}_i(p) = \mathbf{1}\{s_i(p) > \tau_s\}$ are extracted, morphologically filtered, and merged via a small-area-first priority scheme to yield a non-overlapping set $\widehat{\mathcal{M}} = \{M_j\}_{j=1}^J$ .

2. Template-Based Mask Classification

Region descriptors are computed for each $M_j$ by masked average pooling of DINOv3 vision foundation model tokens. A template bank $\mathcal{T}_c$ is assembled offline by pooling DINOv3 features over class-masked regions in $N$ annotated frames: $h_{n, c} = \frac{1}{Z_{n, c}} \sum_{u,v} M'_{n, c}(u,v) T^{(n)}(u,v,:), \quad Z_{n,c} = \sum_{u,v} M'_{n, c}(u,v)$ with L2 normalization. Per-class semantic assignment for predicted masks proceeds by cosine similarity matching to templates, summing the top- $k$ template scores $S_{j, c}$ and assigning class $\hat c_j = \arg\max_c S_{j, c}$ . The final pixelwise map $y = F(x)$ results from merged labeled masks.

3. RGB-D Fusion Architectures for Surgical Segmentation

Alternatives to prompt-driven segmentation include end-to-end trainable RGB-D fusion networks. SurgDepth (Jamal et al., 29 Jul 2024) exemplifies this paradigm, integrating RGB and predicted depth maps in a joint ViT encoder–decoder scheme. The architecture comprises:

Parallel projection heads for RGB and depth,
A lightweight “3D awareness” fusion block performing attention between concatenated RGB/depth features and RGB-alone keys/values, with outputs upsampled and linearly separated back into modality-specific streams,
A ViT-B encoder (pretrained via image MAE objectives),
A shallow ConvNeXt decoder.

The fusion block pools $[X_i^{\rm rgb}; X_i^{\rm depth}]$ to $k \times k$ grids, computes queries, keys, values via $1 \times 1$ projections, applies attention, and forms modality-updated representations: $\begin{aligned} Q &= W_q \left(\mathrm{Pool}_{k \times k}\left([X_i^{\rm rgb}, X_i^{\rm depth}]\right)\right),\ K &= W_k(X_i^{\rm rgb}),\quad V = W_v(X_i^{\rm rgb}),\ A &= \mathrm{Softmax}\left(\frac{Q^\top K}{\sqrt{C_d}}\right),\ X_\mathrm{fusion} &= \mathrm{Upsample}_{h, w}(V A),\ (\widehat{X}_i^{\rm rgb}, \widehat{X}_i^{\rm depth}) &= (\mathrm{FC}_{\rm rgb},\,\mathrm{FC}_{\rm depth})(X_\mathrm{fusion}). \end{aligned}$

The final pixelwise semantic logits are trained using cross-entropy loss: $\mathcal{L}_{\rm seg} = -\frac{1}{HW} \sum_{u=1}^H \sum_{v=1}^W \sum_{c=1}^C y_{u, v, c} \log p_{u, v, c}$

4. Quantitative Performance and Annotation Efficiency

Performance metrics are consistently reported via mean intersection-over-union (mIoU): $\mathrm{IoU}_c = \frac{\mathrm{TP}_c}{\mathrm{TP}_c + \mathrm{FP}_c + \mathrm{FN}_c}, \quad \mathrm{mIoU} = \frac{1}{C}\sum_{c=1}^C \mathrm{IoU}_c.$

In the CholecSeg8k dataset (Yang et al., 5 Dec 2025):

Method	mIoU (%)
Mask2Former (supervised)	78.4
SAM2 + 8D cosine-matching	14.7
SAM2 + DINOv3 template-match	31.6
DepSeg (depth prompts, full)	35.9

Annotation efficiency is achieved by template selection: using only $10\%-20\%$ of templates in the bank yields $31.2\%-33.6\%$ mIoU (vs.\ $35.9\%$ with all templates), thus reducing annotation effort by $\sim80\%-90\%$ with minimal quality loss. No further model finetuning is necessary; only template features are precomputed and stored.

In RGB-D fusion paradigms, SurgDepth achieves the following SOTA mIoU values (Jamal et al., 29 Jul 2024):

Dataset	SurgDepth IoU
SAR-RARP50	0.862
CholecSeg8k	0.556
LapI2I	0.981
EndoVis2017	0.613 (challenge IoU)

Early fusion outperforms late fusion, and increasing ConvNeXt decoder depth past four blocks does not boost accuracy.

5. Analysis of Methodological Differences

DepSeg and SurgDepth represent the two principal strategies in depth-guided segmentation: training-free geometric prompting with template-based labeling versus end-to-end deep RGB-D joint encoding. DepSeg is annotation-efficient, requiring only a template bank and frozen pretrained networks, with no additional model training or finetuning. The procedure is stable, deterministic, and modular, but yields a significantly lower upper-bound mIoU compared to fully-supervised or joint RGB-D training. SurgDepth, by contrast, achieves higher absolute accuracy (near-supervised SOTA) but at the cost of model retraining, complete dataset-specific supervision, and heavier computational resources.

A plausible implication is that DepSeg methods are best suited for rapid, low-resource bootstrapping scenarios or where annotation cost is prohibitive, while RGB-D fusion methods dominate where dense ground truth is available and maximizing segmentation accuracy is critical.

6. Limitations and Future Directions

Current depth-guided segmentation faces several critical challenges:

Rare class performance: Both DepSeg and RGB-D models assign near-zero IoU to rare structures (e.g., hepatic veins (Yang et al., 5 Dec 2025)).
Depth estimation domain gaps: Monocular depth networks, often pretrained on natural images, degrade under strong specularities, blood, and tissue deformation prevalent in surgical scenes (Jamal et al., 29 Jul 2024).
Temporal consistency: Existing DepSeg pipelines operate frame-wise; no temporal smoothing or track-based mask refinement is applied.
Template selection and coverage: Uniform template sampling fails for fine-grained or underrepresented anatomical categories.

Potential improvements include smarter active sampling for templates, integration of temporal cues (e.g., recurrent or video-based modules), leveraging metric depth/3D priors if available, and introducing lightweight finetuning to adapt pretrained segmenters to surgical domains. The plug-and-play structure of RGB-D fusion blocks further facilitates extension to multi-modal and multi-view surgical scene parsing.

7. Context Within the Broader Field

The introduction of depth-guided prompting and template-based classification (Yang et al., 5 Dec 2025) aligns with ongoing efforts to reduce annotation burden in medical imaging and to exploit geometric cues unavailable to RGB-only segmenters. SurgDepth (Jamal et al., 29 Jul 2024) demonstrates that ViT-based encoder–decoder models with explicit depth fusion can achieve or exceed the state of the art on multiple public benchmarks with modest increases in computational cost. Both methods underscore the importance of robust depth priors for improving mask delineation in challenging scenarios with poor color contrast, specular artifacts, and occlusions.

Depth-guided segmentation continues to evolve, with open challenges including real-time deployment, robustness to domain shift, and construction of comprehensive, calibrated RGB-D surgical datasets for further benchmarking and development.