- The paper introduces a two-stage framework that first uses stochastic diffusion for uncertainty estimation and then applies deterministic refinement for metric depth enhancement.
- It employs pixel-wise variance from multiple diffusion samples to identify unreliable regions and correct noise and artifacts in sensor depth maps.
- Evaluations on synthetic and real-world datasets demonstrate significant improvements in depth accuracy and robustness for diverse 3D applications.
Depth maps, which provide pixel-wise distances, are fundamental for many 3D applications in areas like autonomous driving, robotics, and augmented reality. While depth sensors are increasingly common, their outputs often suffer from noise, missing data, and artifacts due to hardware limitations and environmental factors. Existing methods for depth enhancement, such as depth completion or super-resolution, often assume clean input or may amplify these imperfections. The paper "Perfecting Depth: Uncertainty-Aware Enhancement of Metric Depth" (2506.04612) proposes a novel two-stage framework to address these challenges by enhancing sensor depth maps to be dense and artifact-free.
The proposed framework, called Perfecting Depth, avoids the need for manually defining priors for sensor artifacts. Its core innovation lies in leveraging the stochastic nature of diffusion models to automatically identify unreliable depth regions and estimate geometric structure. The framework consists of two main stages: stochastic estimation and deterministic refinement.
1. Stochastic Estimation Stage
This stage uses a diffusion probabilistic model trained on clean, synthetic ground-truth depth data from the Hypersim dataset. During training, the model learns the conditional distribution of clean depth given an RGB image and a masked version of the clean depth. Scale-shift normalization is applied to depth values, mapping them to [−1,1] to prevent the model from overfitting to a specific metric range.
The key idea for uncertainty estimation is a deliberate training-inference domain gap. While trained on clean data, the model performs inference on raw, potentially noisy and incomplete, real-world sensor depth maps. The diffusion process starts from different random noise realizations and iteratively denoises to generate multiple possible depth reconstructions (D~i).
For a given pixel (x,y), the variability across these N samples, quantified by the pixel-wise variance σ^2(x,y), serves as an indicator of its reliability. A high variance suggests that the input value at that pixel was inconsistent with the clean data distribution the model learned, thus identifying it as potentially unreliable. The pixel-wise mean μ^(x,y) of these samples captures the estimated geometric structure. In practice, N=10 samples are used, as analysis shows diminishing returns beyond this number while increasing inference time.
The relationship between variance and reliability can be understood through a Bayesian perspective. The learned diffusion model approximates the posterior distribution p(Dtrue∣RGB,Dcond), where Dcond is Dtrue with a random mask during training, but Draw with its valid mask during inference. Via Bayes' theorem, this posterior is proportional to p(Dcond∣Dtrue)p(Dtrue∣RGB,mask). The likelihood term p(Dcond∣Dtrue) is strong and peaked when Draw is reliable (aligning with clean data), leading to low posterior variance. If Draw is unreliable, this likelihood is weak, and the prior p(Dtrue∣RGB,mask) dominates. If this prior isn't sharp enough to resolve ambiguity, the posterior is broad, resulting in high variance σ^2.
2. Deterministic Refinement Stage
The stochastic estimation provides pixel-wise reliability (via σ^2) and a global geometric prior (via μ^). However, the diffusion process, being global and stochastic, is less suited for precise pixel-level correction, and the metric scale is lost due to normalization. The second stage uses a deterministic refinement network to recover the metric range and refine depth pixel-wise.
First, pixels identified as unreliable by the stochastic stage are filtered out using a simple threshold ϵ on the variance σ^2. A certainty mask Mσ^2 is created, where pixels with σ^2>ϵ are masked out (value 0), and others are kept (value 1). A small ϵ=0.01 is found to be effective. Morphological opening is applied to Mσ^2 to smooth small artifacts.
A reliable depth map Drel is formed by keeping only the reliable values from the input conditioned depth Dcond (which was Draw with its original valid mask). The scaled mean depth Dμ^ is obtained by applying a scale (a) and shift (b) to μ^. These parameters a and b are found by minimizing the squared error between the reliable points in Drel and μ^ using a least-squares fit. This step restores the metric scale based on the trustworthy sensor measurements.
A guidance feature is then extracted using a feature network (implemented with MaxViT as encoder), taking RGB, Drel, Dμ^, and σ^2 as input. This combination provides complementary information: Drel provides accurate sparse anchors, Dμ^ provides a dense, globally consistent structure with metric scale, and σ^2 indicates where refinement is needed.
This guidance feature is fed into a masked spatial propagation network (MSPN, similar to the one in (Melev et al., 4 Jun 2025)) which iteratively refines the depth map. The process starts with Drel and its corresponding mask. The MSPN propagates reliable depth information and refines uncertain regions guided by the feature, ensuring structural consistency and pixel-level accuracy. Six iterative steps with two MSPN layers are used in the implementation.
The deterministic refinement network is trained on synthetic data. The inputs are derived from synthetic ground truth processed through the trained diffusion model to obtain μ^ and σ^2. The network is optimized against the synthetic ground truth using L1 and L2 loss. To improve generalization to varying metric ranges in real-world data, random scale and shift are applied to the depth data during training of this stage.
Implementation and Evaluation
The framework is trained on 54K RGB-D images from the Hypersim synthetic dataset. For evaluation, it is tested on real-world indoor datasets: DIODE-Indoor (dense, noisy), NYUv2 (clean with missing regions), and ScanNet (clean with missing regions). Performance is measured using RMSE, δk accuracy, and Kendall's τ (for relative depth tasks).
Experiments demonstrate the framework's effectiveness:
- Sensor Depth Enhancement: Fine-tuning relative depth estimators (Depth Anything V1/V2) on DIODE-Indoor data enhanced by Perfecting Depth consistently improves performance on both the noisy DIODE test set and the cleaner NYUv2 zero-shot relative depth test set compared to using raw data or diffusion-only enhancement. This shows the method effectively reduces noise and artifacts while preserving details important for downstream tasks.
- Noisy Depth Completion: The framework is evaluated by adding controlled Gaussian noise to sparse ground-truth depth maps and refining the resulting noisy inputs using different methods. Perfecting Depth significantly improves the performance of standard depth completion methods (CFormer (Kumar et al., 2020), MSPN (Melev et al., 4 Jun 2025)) under varying noise levels, highlighting its robustness to real-world sensor-like artifacts.
- Depth Inpainting: When tested on synthetic depth maps with large masked regions (simulating missing data), Perfecting Depth outperforms state-of-the-art monocular relative depth estimators (Depth Anything V1/V2 (Yang et al., 13 Jun 2024, Du et al., 18 Jan 2024), Marigold (Fan et al., 18 Mar 2024)) across most hole-to-image ratios. It effectively utilizes available depth priors to reconstruct missing areas with better geometric detail and consistency.
Limitations
While effective, the framework has limitations. The normalization process requires handling infinite-depth regions (like the sky) with a separate mask to avoid compressing valid depth ranges. The paper suggests that incorporating techniques from adaptive ordinal regression (like AdaBins (Mohammed et al., 2021)) could further improve accuracy and texture details.
In summary, Perfecting Depth provides a practical, data-driven approach for enhancing raw sensor depth. By using diffusion models to identify uncertainty via a training-inference gap and then applying a deterministic refinement network guided by this uncertainty and derived geometric cues, it effectively removes noise, fills missing areas, and improves metric accuracy. Trained on synthetic data, it shows strong generalization to diverse real-world sensor data, making it a valuable tool for applications requiring high-quality depth maps.