Depth-Guided Point Prompts in Vision

Updated 12 December 2025

The paper demonstrates a paradigm where depth-guided point prompts leverage monocular and LiDAR depth cues through clustering and morphological processing, achieving up to 35.9% mIoU in surgical scene segmentation.
It extends to 3D vision by integrating learned prompt points and confidence-weighted adjustments in frameworks like GAPrompt, enabling parameter-efficient fine-tuning with significant mAP improvements.
Beyond segmentation and detection, these methods enable scalable multi-modal fusion for applications such as 3D reconstruction and robotic manipulation, with promising future extensions including temporal prompt adaptation.

Depth-guided point prompts are an architectural and algorithmic paradigm where spatial depth cues, either estimated from images or directly sensed, inform the generation, selection, or weighting of point-based prompts to guide perception models, especially for segmentation, detection, or geometric understanding. These techniques operationalize geometric priors—such as monocular or LiDAR-derived depth—by forming explicit point sets or embedding vectors that steer the downstream model toward semantically meaningful, geometrically coherent outputs with minimal additional supervision or parameter overhead. The following sections provide a technical synthesis spanning 2D surgical scene segmentation, 3D point cloud models, and multi-modal fusion for object detection.

1. Motivation and Theoretical Underpinnings

The integration of geometric priors—particularly relative or metric depth—into vision model prompting addresses foundational challenges in segmentation and detection: (1) resolving visual ambiguities between spatially contiguous but semantically distinct regions, (2) overcoming limitations of color/texture-based grouping, and (3) achieving annotation efficiency and computational scalability. These priors are essential where dense manual labeling is infeasible, such as large-scale surgical datasets, urban 3D scenes, or robotics scenarios. Depth-guided point prompts abstract away from reliance on network-internal representations and instead externalize explicit, geometry-aware point sets or features, in turn facilitating training-free or parameter-efficient strategies that benefit from pretrained “foundation” models (Yang et al., 5 Dec 2025, Guo et al., 17 Dec 2024, Ai et al., 7 May 2025).

2. Depth-Guided Point Prompts for Surgical Scene Segmentation

In “See in Depth: Training-Free Surgical Scene Segmentation with Monocular Depth Priors” (Yang et al., 5 Dec 2025), the DepSeg framework demonstrates a canonical depth-guided point prompt pipeline for per-pixel segmentation:

Monocular Depth Estimation: A frozen DepthAnythingV2 network predicts relative depth $d(p)$ for each pixel $p$ in the input RGB frame $x$ . Contrast is enhanced by subtracting the median depth: $d_{\text{rel}}(p) = d(p) - \operatorname{median}(d)$ .
Region Proposal via Depth Cues: The relative depth map is clustered (K-means), enhanced by edge (Canny) and threshold (Otsu) masks, then segmented using watershed methods and morphological cleaning. This process segments the scene into regions $\tilde{R}_m$ that are spatially and geometrically distinct.
Point Prompt Selection: Within each cleaned region, a distance transform $D_m(p)$ from region boundaries $\partial \tilde{R}_m$ is computed. Interior seed points $S_m$ are identified as local maxima of $D_m$ , representing archetypal points deep within regions, minimizing boundary ambiguity.
Prompting the Mask Generator: Each point set $S_m$ is provided as a prompt to the SAM2 segmentation model, yielding soft masks $\sigma_m(p)$ , which are thresholded, morphologically cleaned, and size-filtered.
Class Assignment via Template Matching: Masks are embedded via pooled DINOv3 features, matched against a classwise template bank formed offline from annotated frames. Assignment is via aggregation of top-k cosine similarities $S_{j,c} = \sum_{\ell \in \mathrm{TopK}(s_{j,c,\cdot},k)} s_{j,c,\ell}$ .

Algorithmic steps for the depth-guided prompt proposal are summarized in the following pseudocode (from (Yang et al., 5 Dec 2025)):

Input: RGB x, relative depth d_rel, #clusters K, min_region_area A_min
Output: {S_m} point sets

1. DepthClusters ← KMeans(d_rel, K)
2. Edges ← Canny(x); Binaries ← OtsuThreshold(d_rel)
3. Regions ← Watershed(DepthClusters ∪ Edges ∪ Binaries)
4. Cleaned ← MorphologicalClean(Regions)
5. for each region R in Cleaned do
     if area(R)<A_min then continue
     D ← DistanceTransform(R)
     S ← LocalMaxima(D, min_dist, border_margin)
     save S
end for
return {S_m}

On the CholecSeg8k dataset, this depth-guided point prompt approach achieves 35.9% mIoU using full template banks (Frac=1.0, k=7), compared to only 14.7% for auto-masked SAM2 with simple RGB/area/aspect descriptors (Yang et al., 5 Dec 2025).

3. Geometry-Aware and Depth-Guided Point Prompts in 3D Vision

In the 3D domain, depth-guided point prompts find rigorous realization in point cloud transformers and parameter-efficient fine-tuning methods. The “Geometry-Aware Point Cloud Prompt” (GAPrompt) framework (Ai et al., 7 May 2025) generalizes the concept:

Point Prompt: Augment the input point cloud $x \in \mathbb{R}^{S \times 3}$ with a learned point set $P \in \mathbb{R}^{P \times 3}$ (concatenation), providing explicit geometric guidance. Depth-guided extensions append per-point confidence or depth channels, e.g., $(x_i, y_i, z_i, d_i)$ .
Point Shift Prompter: Computes a global shape code and shifts each input point by $\Delta x_i$ via an MLP, optionally conditioned on depth or confidence, thus modulating prompt influence by geometric certainty.
Prompt Propagation: At each transformer block, prompts are injected into local neighborhoods (via FPS/KNN), propagated by confidence-weighted interpolation, and included in token mixing/attention.
Losses: Additional losses enforce shift consistency ( $\mathcal{L}_{\text{shift}}$ ), depth alignment ( $\mathcal{L}_{\text{depth}}$ ), and prompt-geometry range constraints ( $\mathcal{L}_{\text{prompt\_range}}$ ).

Composite architectures like GAPrompt enable the training of highly competitive 3D models utilizing only 2.19% of parameters required by full fine-tuning, demonstrating that depth- or geometry-informed prompts are computationally and statistically efficient for large 3D vision tasks (Ai et al., 7 May 2025).

Depth-guided prompts also operationalize multi-modal fusion, especially for metric depth estimation and 3D object detection. In “Prompt Depth Anything” (Lin et al., 18 Dec 2024), the model accepts sparse LiDAR as point-wise or grid-wise prompts, fusing them with multiscale image features:

Prompt Fusion in Decoding: At each scale $i$ of the depth decoder, the LiDAR map $L$ is bilinearly resized, transformed with 3×3 convolutions, and zero-initialized to ensure initial model equivalence. The processed LiDAR features $p_i$ are added to image features $F_i$ before further blending:

$d_i = R_i(L),\quad h_i = S(d_i),\quad p_i = W_i^{(1 \times 1)}(h_i),\quad F_i^* = F_i + p_i$

Training: Supervision integrates synthetic (L1 and gradient losses), real, and pseudo-GT depth (ZipNeRF). Notably, the architecture allows up to 4K-resolution outputs and supports end-to-end depth-prompted learning on diverse domains.

In 3D object detection, PromptDet (Guo et al., 17 Dec 2024) introduces the LiDAR point cloud as a “prompt” to a camera-based BEV detector. Adaptive Hierarchical Aggregation (AHA) and Cross-Modal Knowledge Injection (CMKI) modules enable efficient fusion and transfer of LiDAR-informed depth cues:

AHA: Multi-scale LiDAR and camera voxel features are projected, concatenated, softmax-weighted, and summed to produce a BEV prompt embedding $F^f$ .
CMKI: The camera-only branch is trained to imitate the LiDAR-prompt-aware features and responses via feature, relational, and response distillation losses.
Results: PromptDet with LiDAR improves mAP and NDS by as much as 22.8% and 21.1% respectively over camera-only baselines, with only ∼2% parameter increase (Guo et al., 17 Dec 2024).

5. Hyperparameters, Implementation Considerations, and Limitations

The efficacy of depth-guided prompting methods depends acutely on several architectural and algorithmic details:

Number of clusters (K) in region proposal—small $K$ risks undersegmentation, large $K$ causes overfragmentation (Yang et al., 5 Dec 2025).
Seed selection strategy for interior point prompts (distance transform maxima)—sparse seeds limit coverage, dense seeds may introduce redundancy.
Morphological and area thresholds to suppress noisy or trivial regions.
Template bank fraction ( $\alpha$ ) and top-k selection ( $k$ ) in template-matching—performance saturates at low template coverage (10–20%) and $k=7$ .
Prompt parameterization in 3D—confidence channels should be properly normalized, and prompt points initialized in alignment with input cloud statistics (Ai et al., 7 May 2025).

Documented limitations include the relative, non-metric nature of monocular depth, breakdown in scenes with specularities or rapid motion, low performance on rare or ambiguous regions, potential for over-fragmentation, and lack of temporal consistency (frame-level only) (Yang et al., 5 Dec 2025).

6. Impact, Extensions, and Future Directions

Depth-guided point prompts have made significant advances toward annotation-efficient, modular, and scalable vision systems, exemplified by:

Training-free segmentation (DepSeg), parameter-efficient 3D vision models (GAPrompt), and multi-modal depth/aided detection (PromptDet).
Quantitative gains in annotation-sparse regimes and multi-modal generalization.
Downstream benefits in tasks such as 3D reconstruction (TSDF integration) and generalized robotic manipulation, where metrics such as grasp success on specular/transparent objects are improved by complete, denoised metric depth estimation (Lin et al., 18 Dec 2024).

Envisioned extensions include temporal prompt adaptation, probabilistic or confidence-based prompt selection, multi-modal prompt integration (SLAM, IMU), and adaptive template bank construction for per-class balancing or user-in-the-loop guidance (Yang et al., 5 Dec 2025, Ai et al., 7 May 2025, Lin et al., 18 Dec 2024, Guo et al., 17 Dec 2024). This suggests an increasing tendency to externalize geometric and other sensory priors as explicit prompts, both at inference and for efficient parameter or data utilization.