Papers
Topics
Authors
Recent
2000 character limit reached

Re-Depth Anything: Robust Depth Estimation

Updated 26 December 2025
  • Re-Depth Anything is a paradigm that revisits monocular depth estimation by leveraging foundation models, prompt-based fusion, and self-supervised test-time refinement.
  • The approach integrates coarse-to-fine sensor prior fusion with uncertainty-aware techniques, achieving improved metric accuracy on diverse benchmarks.
  • Innovative methods such as diffusion-guided score distillation and multi-modal compositional pipelines extend its utility to panoramic, multi-view, and medical imaging applications.

Re-Depth Anything refers to a comprehensive paradigm for revisiting monocular depth estimation by leveraging foundation models, multi-modal priors, and self-supervised or guided refinement techniques to address domain shift, sparse/uncertain depth priors, and specialized downstream applications. This approach centers on augmenting or adapting high-capacity foundation models—such as Depth Anything and its derivatives—to new domains, input modalities, or improved accuracy regimes, particularly when ground-truth depth is scarce or incomplete. Innovations in this area include prompt-based fusion with LiDAR, test-time diffusion-guided refinement, coarse-to-fine integration of sparse and dense priors, panoramic and multi-view generalization, and robust compositional pipelines for downstream vision and language tasks.

1. Foundation Model Adaptation and Prompting

Depth foundation models such as Depth Anything and its successors achieve robust zero-shot monocular depth estimation via large-scale training on synthetic and pseudo-labeled real images, but exhibit limitations in metric scaling and adaptation to novel domains. The introduction of prompt-based methods, exemplified by Prompt Depth Anything, addresses these gaps by integrating sparse, metrically accurate cues (e.g., low-cost LiDAR) as multi-scale prompts within the decoder of a depth foundation model (Lin et al., 18 Dec 2024).

Key elements include ViT–DPT architectures for initial dense geometric inference, injection of sparse LiDAR features via per-scale prompt fusion blocks, and multi-scale loss functions. The fusion block employs upsampling, shallow CNN feature extraction, and zero-initialized 1×11\times1 convolutions for stable fine-tuning. This paradigm achieves accurate metric depth at up to 4K resolution and demonstrates significant improvements on benchmarks such as ARKitScenes and ScanNet++.

2. Coarse-to-Fine Prior Fusion and Uncertainty-Aware Enhancement

Real-world deployment frequently offers two complementary but incomplete depth cues: sparse, noisy metric priors from sensors (e.g. LiDAR, multi-view structure-from-motion) and dense, scale-ambiguous predictions from monocular models. Prior Depth Anything fuses these sources in a two-stage pipeline: deterministic metric alignment with distance-aware weighting to propagate the sparse metric prior while preserving structure, followed by a conditioned monocular depth head that refines and harmonizes the fused prior and prediction (Wang et al., 15 May 2025). Pixel-wise least-squares scaling and shift, weighted by inverse distance to valid prior points, ensure geometric consistency at boundaries.

Perfecting Depth further advances this concept by using a stochastic diffusion model to estimate per-pixel uncertainty during sensor depth enhancement, followed by a deterministic, uncertainty-guided spatial propagation network that selectively refines unreliable regions (Jun et al., 5 Jun 2025). This combination of stochastic uncertainty modeling and masked refinement achieves robust accuracy and reliability improvements across standard inpainting, completion, and enhancement tasks.

3. Test-Time Depth Refinement and Self-Supervision

To address the degradation of foundation depth model predictions on out-of-distribution scenes or specific error patterns, the Re-Depth Anything framework introduces test-time optimization that leverages self-supervised geometric cues (Bhattarai et al., 19 Dec 2025). The core innovation is generative score distillation using 2D diffusion priors on re-lighted renderings. The pipeline begins with a foundation model prediction, recovers surface normals via unprojection, synthesizes photorealistic re-lightings across randomized illumination using Blinn-Phong shading, and evaluates the plausibility of these synthetic images with a pretrained diffusion model (e.g., Stable Diffusion).

Score Distillation Sampling (SDS) gradients are back-propagated only into decoder and intermediate embeddings of the foundation model (with the encoder frozen), yielding consistent improvements in fine detail, large-scale geometry, and domain robustness across multiple benchmarks. This diffusion-guided, label-free refinement establishes a new test-time self-supervised correction regime for monocular depth estimation.

4. Generalization Across Views, Modalities, and Data Types

Re-Depth Anything approaches incorporate architectural designs and data curation strategies that facilitate generalization to challenging cases, including panoramic images, multi-view inputs, and diverse sensor modalities. DA2^2 (Depth Anything in Any Direction) combines curated large-scale panoramic data (generated via perspective-to-equirectangular projection and out-painting) with the SphereViT backbone, explicitly encoding spherical geometry through cross-attention with learnable spherical positional embeddings (Li et al., 30 Sep 2025). This results in state-of-the-art, efficient, and distortion-aware zero-shot panoramic depth estimation, outperforming even in-domain methods.

Depth Anything 3 generalizes single- or multi-view geometry prediction by operating on arbitrary numbers of unposed images, producing spatially consistent depth and ray parameterizations, and enabling multi-view fusion and pose estimation using a plain transformer backbone and a dual-branch DPT prediction head (Lin et al., 13 Nov 2025).

5. Compositional and Domain-Adapted Usage

Re-Depth Anything methodologies support integration with downstream reasoning and segmentation pipelines. For instance, combining DAM-derived depth with SAM segmentation and GPT-4V for vision-language compositional reasoning (e.g., instance-aware depth for robust VQA) demonstrates the utility of instance-level fusion in enhancing multimodal tasks (Huo et al., 7 Jun 2024). In medical imaging, adaptation techniques such as LoRA with random vectors (RVLoRA) and depthwise separable convolution residuals enable parameter-efficient domain transfer of foundation models to endoscopic imagery, elevating both accuracy and inference speed (Li et al., 12 Sep 2024). Similarly, polyp segmentation benefits from incorporating DAM depth priors into multi-scale, global-local segmentation networks, yielding boosts in mean Dice and IoU across varied datasets (Zheng et al., 3 Feb 2024).

6. Benchmarks, Evaluation, and Limitations

State-of-the-art Re-Depth Anything pipelines are evaluated on a wide spectrum of publicly available and curated benchmarks. Key metrics include AbsRel, RMSE, threshold accuracy (δ), and TSDF-based 3D reconstruction F-scores. Ablation studies consistently underscore the superiority of prompt-/prior-based fusion, targeted refinement, and zero-shot generalization over traditional monocular or even post-hoc alignment methods. Reported limitations include performance degradation with increasing sensor noise or range, insufficient handling of temporal instability or multi-modal prompts, and open questions in fully recovering absolute metric scale in highly unconstrained domains (Lin et al., 18 Dec 2024, Wang et al., 15 May 2025).

7. Future Directions in the Re-Depth Paradigm

Speculative extensions identified in the literature include the design of meta-prompt layers that dynamically re-weight fusion scales based on prompt density, adversarial or uncertainty-aware prompt dropout for robustness to missing data, and fuse time-consistent priors for video stabilization (Lin et al., 18 Dec 2024). Integration with semantic scene graphs, semantic and geometric foundation models, and more efficient attention mechanisms are also being considered. The overall paradigm positions itself as modular and self-improving: as monocular, multi-view, and sensor-conditioned depth models advance, their integration points in the Re-Depth framework ensure future-proof accuracy/efficiency trade-offs and broad application reach.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Re-Depth Anything.