Marigold-DC Depth Completion Model

Updated 12 December 2025

The paper introduces a novel diffusion-based depth completion model that redefines completion as image-conditional sampling in the reverse diffusion process, ensuring robust zero-shot generalization.
The model architecture employs a modified Stable Diffusion U-Net with a multi-channel design that fuses image features and sparse depth cues through cross-attention and gradient-based optimization.
The approach achieves state-of-the-art performance on datasets like NYUv2 and KITTI by leveraging geometry-aware sparse sampling and rigorous optimization to generate dense, metrically valid depth maps.

Marigold-DC is a diffusion-based depth completion framework that leverages pre-trained monocular depth priors and sparse depth guidance to produce dense, metrically accurate depth maps from RGB images and irregular, highly sparse depth input. Developed in response to the limitations of conventional depth completion approaches—particularly their brittleness under domain shift and uneven/missing sensor measurements—Marigold-DC redefines completion as image-conditional sampling in a generative latent diffusion process, robustly constraining the output with sparse sensor data at inference time (Viola et al., 18 Dec 2024, Gregorek et al., 16 Sep 2024, Salloom et al., 9 Dec 2025, Liu et al., 17 Apr 2024, Talegaonkar et al., 23 May 2025).

1. Problem Formulation and Key Observations

Marigold-DC conceptualizes depth completion as posterior sampling

$p(D \mid I,\, D_{\mathrm{sparse}})$

where $D$ is the dense depth map, $I$ is the input image, and $D_{\mathrm{sparse}}$ contains valid depths only at a small irregular set of pixels $\Omega$ . Rather than directly regressing $D$ from $(I,D_{\mathrm{sparse}})$ via supervised learning as in most prior approaches, Marigold-DC leverages a powerful diffusion-trained monocular depth prior $p(D\mid I)$ inherited from large-scale synthetic data (Viola et al., 18 Dec 2024). The sparse depth anchors the completion, without overfitting, via optimization embedded inside the reverse diffusion inference chain.

This formulation yields strong zero-shot generalization, consistent metric accuracy even with sparse and spatially nonuniform guidance, and excellent robustness to missing regions or domain shift. These properties have enabled state-of-the-art results on datasets such as NYUv2, ScanNet, KITTI DC, and others (Viola et al., 18 Dec 2024, Gregorek et al., 16 Sep 2024, Salloom et al., 9 Dec 2025).

2. Model Architecture and Diffusion Framework

The architecture is based on the Stable Diffusion U-Net backbone adapted for depth. It comprises the following components, detailed in (Viola et al., 18 Dec 2024, Liu et al., 17 Apr 2024):

Encoder–Decoder Backbone: A VAE encoder $\mathcal{E}$ and decoder $\mathcal{D}$ operate on images and depth maps, producing latent representations at $4\times$ downsampled spatial resolution.
U-Net Denoiser $\epsilon_\theta$ : Receives a noisy depth latent at each timestep, concatenated with the encoded image features and a mask or sparse depth latent. Cross-attention blocks fuse image and depth features.
Latent Channel Structure: The first U-Net convolution is expanded to 13 channels: 4 (noisy depth), 4 (masked clean depth), 4 (image), 1 (mask).
Conditioning Mechanism: At inference, sparse depth samples are embedded via a small CNN and concatenated with intermediate U-Net activations.

The diffusion process consists of the standard forward noising Markov chain,

$q(z_t \mid z_{t-1}) = \mathcal{N}(z_t;\,\sqrt{\alpha_t}\,z_{t-1},\, (1-\alpha_t)I)$

and a learned reverse step

$p_\theta(z_{t-1}|z_t,\,c) = \mathcal{N}(z_{t-1};\,\mu_\theta(z_t, t, c),\,\sigma_t^2 I)$

where $c$ represents the encoded conditioning features (image and optionally sparse depth).

Training uses a simple score-matching loss on the predicted noise residual,

$\mathcal{L}_{\mathrm{diff}} = \mathbb{E}_{d,\,\epsilon\sim\mathcal N(0,I),\,t} \left\| \epsilon - \epsilon_\theta(z_t,\,t, c) \right\|_2^2$

No auxiliary or regularization losses are added during training (Liu et al., 17 Apr 2024). Fine-tuning typically runs for 200 epochs on large synthetic datasets, using random masking and percentile normalization as described in (Liu et al., 17 Apr 2024).

3. Test-Time Sparse Depth Guidance and Optimization

Inference is characterized by a tightly coupled diffusion–optimization loop steering the output to match sparse measurements, first introduced in (Viola et al., 18 Dec 2024), with alternative “steering” variants in (Gregorek et al., 16 Sep 2024).

Preview Computation: At each reverse step, compute the preview of the clean latent via Tweedie’s formula,

$\hat z_0 = \frac{z_t-\sqrt{1-\bar\alpha_t}\,\hat\epsilon_t}{\sqrt{\bar\alpha_t}}$

and decode to the pixel space depth map $\hat D$ .

Affine Scale/Fit: Convert to metric via $\hat D^{(m)} = s\,\hat D + b$ , with $s$ and $b$ optimized to minimize error to $D_{\mathrm{sparse}}$ .
Sparse Loss: Mixed $\ell_1$ – $\ell_2$ loss restricted to observed pixels,

$\mathcal{L}_{\mathrm{sparse}} = \sum_{i\in\Omega} \left( |\hat D^{(m)}_i - D_{\mathrm{sparse},i}| + (\hat D^{(m)}_i - D_{\mathrm{sparse},i})^2 \right)$

Gradient-Based Optimization: Backpropagate through the decoder and update $(z_t, s, b)$ by Adam. Typical learning rates are $0.05$ (latent), $0.005$ (affine) (Viola et al., 18 Dec 2024). Repeat for $T=50$ steps.

Alternative steering mechanisms (e.g., SteeredMarigold (Gregorek et al., 16 Sep 2024)) blend decoded previews and sparse values, then re-encode and shift the latent. The steering strength parameter $\lambda$ controls adherence to sparse depths, with values in $0.1$–$0.4$ yielding optimal trade-off between global realism and local metric fidelity.

4. Sparse Depth Sampling Strategies

Marigold-DC’s robustness to high sparsity and spatial irregularity is further enhanced by geometry-aware sampling (Salloom et al., 9 Dec 2025):

PCA-Based Reliability Sampling: Ground-truth depth is back-projected to 3D, normals and curvature estimated via local PCA. Per-pixel reliability $r_i$ is computed as $r_i = |\mathbf{n}_i^\top \mathbf{v}_i |^\beta$ , where $\mathbf{n}_i$ is the surface normal, $\mathbf{v}_i$ the viewing direction, and $\beta \geq 1$ penalizes grazing angles.
Sampling Distribution: Sparse pixels for $D_{\mathrm{sparse}}$ are drawn according to $p_i = r_i / \sum_j r_j$ , yielding realistic sensor-like patterns.
Empirical Results: Geometry-aware sampling reduced RMSE up to 12% at the lowest sparse budgets (e.g., 100 points on NYUv2), and produced more faithful edge representation and error heatmaps (Salloom et al., 9 Dec 2025).

This suggests that domain-matched, physically motivated sparsity patterns are crucial for realistic evaluation and deployment in robotic or autonomous driving systems.

# Sparse pixels	Geometry-Aware MAE / RMSE	Uniform Random MAE / RMSE
100	0.086 / 0.186	0.089 / 0.211
200	0.079 / 0.189	0.086 / 0.207
300	0.058 / 0.147	0.062 / 0.159
500	0.039 / 0.099	0.039 / 0.105

5. Practical Implementation and Dataset Coverage

Data Sources: Marigold-DC is pretrained on synthetic data (Hypersim, Virtual KITTI, SceneFlow), but is evaluated zero-shot on diverse real-world datasets: NYUv2, ScanNet, iBims-1 (indoor), KITTI DC, DDAD (outdoor) (Viola et al., 18 Dec 2024).
Sparse Depth Patterns: Both random uniform and geometry-aware patterns are supported, with robust performance documented down to budgets of $\sim$ 100 points.
Optimization Hyperparameters: DDIM with 50 steps suffices for stabilization; ensemble runs ( $M=10$ ) boost median accuracy. Min–max initialization for affine scaling consistently outperforms naive least squares.
Computational Complexity: Inference requires iterative diffusion steps plus inner optimization, leading to multi-second runtimes per image. No further fine-tuning is needed for domain adaptation or additional sensor types.

6. Quantitative Performance and Qualitative Qualities

Marigold-DC achieves state-of-the-art accuracy in zero-shot depth completion settings:

NYUv2/ScanNet (500 points): RMSE on ScanNet is $\approx$ 0.057 m (vs 0.076 m for previous SOTA), NYUv2 $\approx$ 0.124 m (vs 0.247 m) (Viola et al., 18 Dec 2024).
Degraded Sampling Settings: SteeredMarigold remains robust to large missing regions where conventional networks fail or degrade catastrophically (Gregorek et al., 16 Sep 2024).
Qualitative Properties: Results feature sharp edge alignment, correct far-plane placement, continuous surfaces, and reduced boundary error amplification (Liu et al., 17 Apr 2024).
Metric Depth Recovery: Joint optimization over latent variables and affine scalars yields metrically valid depth maps, not just relative geometry, matching or surpassing task-specific metric depth estimators (Talegaonkar et al., 23 May 2025).

Common failure modes include diminished texture realism under strong steering and increased inference time. Performance saturates near $T=50$ steps; ablation studies consistently highlight the impact of optimization mechanism, learning rate, and initialization strategy (Viola et al., 18 Dec 2024, Gregorek et al., 16 Sep 2024).

7. Extensions and Integration in Broader 3D Pipelines

The Marigold-DC approach is readily extensible:

Defocus Blur Metric Recovery: Incorporating physical defocus cues and optimizing latent depths against blur renders enables zero-shot metric depth estimation from dual-aperture captures, outperforming classical depth-from-defocus (Talegaonkar et al., 23 May 2025).
3D Gaussian Inpainting: Marigold-DC serves as a high-fidelity inpainting prior for 3D Gaussian representations in view synthesis, enabling seamless geometric and photometric completion and accelerating downstream fine-tuning by up to $20\times$ (Liu et al., 17 Apr 2024).
Robotic Vision Deployment: When paired with geometry-aware sampling, Marigold-DC provides robust perception for manipulation, navigation, and inspection tasks, directly reflecting real sensor reliability.
Plug-and-Play Steering: The steering/inference mechanism can adopt further regularizers or loss terms (e.g., explicit $\ell_2$ constraints on known LiDAR depths (Talegaonkar et al., 23 May 2025)).

A plausible implication is that diffusion-depth priors, when constrained by sparse but physically realistic measurements, fundamentally improve completion accuracy and reliability under the variable, incomplete sensing common in practical robotics, autonomous driving, and scene reconstruction.

Principal References: (Viola et al., 18 Dec 2024, Gregorek et al., 16 Sep 2024, Salloom et al., 9 Dec 2025, Liu et al., 17 Apr 2024, Talegaonkar et al., 23 May 2025)