Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 63 tok/s
Gemini 2.5 Pro 44 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 212 tok/s Pro
GPT OSS 120B 438 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Perfecting Depth: Uncertainty-Aware Enhancement of Metric Depth (2506.04612v1)

Published 5 Jun 2025 in cs.CV

Abstract: We propose a novel two-stage framework for sensor depth enhancement, called Perfecting Depth. This framework leverages the stochastic nature of diffusion models to automatically detect unreliable depth regions while preserving geometric cues. In the first stage (stochastic estimation), the method identifies unreliable measurements and infers geometric structure by leveraging a training-inference domain gap. In the second stage (deterministic refinement), it enforces structural consistency and pixel-level accuracy using the uncertainty map derived from the first stage. By combining stochastic uncertainty modeling with deterministic refinement, our method yields dense, artifact-free depth maps with improved reliability. Experimental results demonstrate its effectiveness across diverse real-world scenarios. Furthermore, theoretical analysis, various experiments, and qualitative visualizations validate its robustness and scalability. Our framework sets a new baseline for sensor depth enhancement, with potential applications in autonomous driving, robotics, and immersive technologies.

Summary

  • The paper introduces a two-stage framework that first uses stochastic diffusion for uncertainty estimation and then applies deterministic refinement for metric depth enhancement.
  • It employs pixel-wise variance from multiple diffusion samples to identify unreliable regions and correct noise and artifacts in sensor depth maps.
  • Evaluations on synthetic and real-world datasets demonstrate significant improvements in depth accuracy and robustness for diverse 3D applications.

Depth maps, which provide pixel-wise distances, are fundamental for many 3D applications in areas like autonomous driving, robotics, and augmented reality. While depth sensors are increasingly common, their outputs often suffer from noise, missing data, and artifacts due to hardware limitations and environmental factors. Existing methods for depth enhancement, such as depth completion or super-resolution, often assume clean input or may amplify these imperfections. The paper "Perfecting Depth: Uncertainty-Aware Enhancement of Metric Depth" (2506.04612) proposes a novel two-stage framework to address these challenges by enhancing sensor depth maps to be dense and artifact-free.

The proposed framework, called Perfecting Depth, avoids the need for manually defining priors for sensor artifacts. Its core innovation lies in leveraging the stochastic nature of diffusion models to automatically identify unreliable depth regions and estimate geometric structure. The framework consists of two main stages: stochastic estimation and deterministic refinement.

1. Stochastic Estimation Stage

This stage uses a diffusion probabilistic model trained on clean, synthetic ground-truth depth data from the Hypersim dataset. During training, the model learns the conditional distribution of clean depth given an RGB image and a masked version of the clean depth. Scale-shift normalization is applied to depth values, mapping them to [1,1][-1, 1] to prevent the model from overfitting to a specific metric range.

The key idea for uncertainty estimation is a deliberate training-inference domain gap. While trained on clean data, the model performs inference on raw, potentially noisy and incomplete, real-world sensor depth maps. The diffusion process starts from different random noise realizations and iteratively denoises to generate multiple possible depth reconstructions (D~i\tilde{D}_i).

For a given pixel (x,y)(x, y), the variability across these NN samples, quantified by the pixel-wise variance σ^2(x,y)\hat{\sigma}^2(x,y), serves as an indicator of its reliability. A high variance suggests that the input value at that pixel was inconsistent with the clean data distribution the model learned, thus identifying it as potentially unreliable. The pixel-wise mean μ^(x,y)\hat{\mu}(x,y) of these samples captures the estimated geometric structure. In practice, N=10N=10 samples are used, as analysis shows diminishing returns beyond this number while increasing inference time.

The relationship between variance and reliability can be understood through a Bayesian perspective. The learned diffusion model approximates the posterior distribution p(DtrueRGB,Dcond)p(D_{\text{true}} | \text{RGB}, D_{\text{cond}}), where DcondD_{\text{cond}} is DtrueD_{\text{true}} with a random mask during training, but DrawD_{\text{raw}} with its valid mask during inference. Via Bayes' theorem, this posterior is proportional to p(DcondDtrue)p(DtrueRGB,mask)p(D_{\text{cond}} | D_{\text{true}}) p(D_{\text{true}} | \text{RGB}, \text{mask}). The likelihood term p(DcondDtrue)p(D_{\text{cond}} | D_{\text{true}}) is strong and peaked when DrawD_{\text{raw}} is reliable (aligning with clean data), leading to low posterior variance. If DrawD_{\text{raw}} is unreliable, this likelihood is weak, and the prior p(DtrueRGB,mask)p(D_{\text{true}} | \text{RGB}, \text{mask}) dominates. If this prior isn't sharp enough to resolve ambiguity, the posterior is broad, resulting in high variance σ^2\hat{\sigma}^2.

2. Deterministic Refinement Stage

The stochastic estimation provides pixel-wise reliability (via σ^2\hat{\sigma}^2) and a global geometric prior (via μ^\hat{\mu}). However, the diffusion process, being global and stochastic, is less suited for precise pixel-level correction, and the metric scale is lost due to normalization. The second stage uses a deterministic refinement network to recover the metric range and refine depth pixel-wise.

First, pixels identified as unreliable by the stochastic stage are filtered out using a simple threshold ϵ\epsilon on the variance σ^2\hat{\sigma}^2. A certainty mask Mσ^2\mathbf{M}_{\hat{\sigma}^2} is created, where pixels with σ^2>ϵ\hat{\sigma}^2 > \epsilon are masked out (value 0), and others are kept (value 1). A small ϵ=0.01\epsilon=0.01 is found to be effective. Morphological opening is applied to Mσ^2\mathbf{M}_{\hat{\sigma}^2} to smooth small artifacts.

A reliable depth map Drel\mathbf{D}_{\text{rel}} is formed by keeping only the reliable values from the input conditioned depth Dcond\mathbf{D}_{\text{cond}} (which was DrawD_{\text{raw}} with its original valid mask). The scaled mean depth Dμ^\mathbf{D}_{\hat{\mu}} is obtained by applying a scale (aa) and shift (bb) to μ^\hat{\mu}. These parameters aa and bb are found by minimizing the squared error between the reliable points in Drel\mathbf{D}_{\text{rel}} and μ^\hat{\mu} using a least-squares fit. This step restores the metric scale based on the trustworthy sensor measurements.

A guidance feature is then extracted using a feature network (implemented with MaxViT as encoder), taking RGB, Drel\mathbf{D}_{\text{rel}}, Dμ^\mathbf{D}_{\hat{\mu}}, and σ^2\hat{\sigma}^2 as input. This combination provides complementary information: Drel\mathbf{D}_{\text{rel}} provides accurate sparse anchors, Dμ^\mathbf{D}_{\hat{\mu}} provides a dense, globally consistent structure with metric scale, and σ^2\hat{\sigma}^2 indicates where refinement is needed.

This guidance feature is fed into a masked spatial propagation network (MSPN, similar to the one in (Melev et al., 4 Jun 2025)) which iteratively refines the depth map. The process starts with Drel\mathbf{D}_{\text{rel}} and its corresponding mask. The MSPN propagates reliable depth information and refines uncertain regions guided by the feature, ensuring structural consistency and pixel-level accuracy. Six iterative steps with two MSPN layers are used in the implementation.

The deterministic refinement network is trained on synthetic data. The inputs are derived from synthetic ground truth processed through the trained diffusion model to obtain μ^\hat{\mu} and σ^2\hat{\sigma}^2. The network is optimized against the synthetic ground truth using L1 and L2 loss. To improve generalization to varying metric ranges in real-world data, random scale and shift are applied to the depth data during training of this stage.

Implementation and Evaluation

The framework is trained on 54K RGB-D images from the Hypersim synthetic dataset. For evaluation, it is tested on real-world indoor datasets: DIODE-Indoor (dense, noisy), NYUv2 (clean with missing regions), and ScanNet (clean with missing regions). Performance is measured using RMSE, δk\delta_k accuracy, and Kendall's τ\tau (for relative depth tasks).

Experiments demonstrate the framework's effectiveness:

  • Sensor Depth Enhancement: Fine-tuning relative depth estimators (Depth Anything V1/V2) on DIODE-Indoor data enhanced by Perfecting Depth consistently improves performance on both the noisy DIODE test set and the cleaner NYUv2 zero-shot relative depth test set compared to using raw data or diffusion-only enhancement. This shows the method effectively reduces noise and artifacts while preserving details important for downstream tasks.
  • Noisy Depth Completion: The framework is evaluated by adding controlled Gaussian noise to sparse ground-truth depth maps and refining the resulting noisy inputs using different methods. Perfecting Depth significantly improves the performance of standard depth completion methods (CFormer (Kumar et al., 2020), MSPN (Melev et al., 4 Jun 2025)) under varying noise levels, highlighting its robustness to real-world sensor-like artifacts.
  • Depth Inpainting: When tested on synthetic depth maps with large masked regions (simulating missing data), Perfecting Depth outperforms state-of-the-art monocular relative depth estimators (Depth Anything V1/V2 (Yang et al., 13 Jun 2024, Du et al., 18 Jan 2024), Marigold (Fan et al., 18 Mar 2024)) across most hole-to-image ratios. It effectively utilizes available depth priors to reconstruct missing areas with better geometric detail and consistency.

Limitations

While effective, the framework has limitations. The normalization process requires handling infinite-depth regions (like the sky) with a separate mask to avoid compressing valid depth ranges. The paper suggests that incorporating techniques from adaptive ordinal regression (like AdaBins (Mohammed et al., 2021)) could further improve accuracy and texture details.

In summary, Perfecting Depth provides a practical, data-driven approach for enhancing raw sensor depth. By using diffusion models to identify uncertainty via a training-inference gap and then applying a deterministic refinement network guided by this uncertainty and derived geometric cues, it effectively removes noise, fills missing areas, and improves metric accuracy. Trained on synthetic data, it shows strong generalization to diverse real-world sensor data, making it a valuable tool for applications requiring high-quality depth maps.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube