SteeredMarigold: Steering Diffusion Towards Depth Completion of Largely Incomplete Depth Maps
(2409.10202v2)
Published 16 Sep 2024 in cs.RO and cs.CV
Abstract: Even if the depth maps captured by RGB-D sensors deployed in real environments are often characterized by large areas missing valid depth measurements, the vast majority of depth completion methods still assumes depth values covering all areas of the scene. To address this limitation, we introduce SteeredMarigold, a training-free, zero-shot depth completion method capable of producing metric dense depth, even for largely incomplete depth maps. SteeredMarigold achieves this by using the available sparse depth points as conditions to steer a denoising diffusion probabilistic model. Our method outperforms relevant top-performing methods on the NYUv2 dataset, in tests where no depth was provided for a large area, achieving state-of-art performance and exhibiting remarkable robustness against depth map incompleteness. Our source code is publicly available at https://steeredmarigold.github.io.
The paper introduces a novel training-free method, SteeredMarigold, that steers a diffusion process with sparse depth cues to complete large missing regions.
It integrates a pre-trained Marigold diffusion model with iterative VAE encode-decode steps to effectively harness both RGB images and sparse depth measurements.
Experimental results on NYUv2 show that SteeredMarigold outperforms traditional methods in large missing area scenarios despite higher computational costs.
This paper introduces SteeredMarigold, a novel method for depth completion, particularly designed to handle depth maps with large missing regions, a common issue with real-world RGB-D sensors used in robotics (2409.10202). Unlike traditional depth completion methods that assume relatively uniform sparse depth input, or monocular depth estimation methods that ignore available depth data, SteeredMarigold aims to fill large gaps by leveraging both the RGB image and the available sparse depth points.
The core idea is to use a pre-trained, diffusion-based monocular depth estimator, specifically Marigold (2312.02145), and "steer" its denoising process using the sparse, metric depth measurements (c) as a condition. This approach is training-free, inheriting the zero-shot generalization capabilities of the underlying Marigold model.
The overall process can be summarized by the equation:
d=M(D(diff(E(m),c)),c)
where m is the RGB image, c is the sparse metric depth input, E and D are the VAE encoder and decoder from Marigold, diff represents the steered diffusion process, and M is a transformation (least-squares fit) that converts the relative depth output (d∗) of the diffusion model to metric depth (d) using c as a reference.
Steering Mechanism:
The steering happens within the reverse diffusion process of the DDPM. At each timestep t (from T down to 1):
The diffusion model predicts the noise (or velocity vθ) added to the current latent state xt.
An estimate of the clean latent sample x~0 is computed using:
x~0=αˉtxt−1−αˉtvθ(xt)
The standard reverse diffusion step calculates the potential xt−1.
Steering Adjustment: Before proceeding to the next step, xt−1 is adjusted based on the sparse condition c. This adjustment is performed using the estimated clean sample x~0 decoded to the image space (x~0D=D(x~0)). The core steering equation is:
xt−1←xt−1+λ⋅E(ϕ2(x~0D,c,P)−ϕ1(x~0D,P))
(Note: The paper's Eq. 9 seems slightly rearranged/simplified here for clarity based on the description and similar methods).
λ is the steering strength factor, often scaled by 1−αˉt.
P is a set of pixel locations including all known depth points from c and additional randomly sampled points in areas far (distance >ζ) from any known points.
ϕ1 interpolates depth values at positions P using only the predicted depth x~0D.
ϕ2 interpolates depth values at positions P using the known depth from c where available, and x~0D otherwise. The metric depth values from c must be scaled/shifted to match the relative depth x~0D before interpolation within ϕ2.
The difference (ϕ2−ϕ1) represents the desired change in the estimated clean depth space, which is then encoded back (E) and added to the latent state xt−1.
The architecture involves the Marigold diffusion model (diff), a VAE (encoder E, decoder D), and the plug-and-play steering module that uses E and D within each step.
Implementation Details:
Built using PyTorch and the Hugging Face Diffusers library.
Uses pre-trained Marigold weights (e.g., prs-eth/marigold-depth-v1-0 from Hugging Face).
Requires no model training.
Diffusion run typically uses 50 steps (DDPM).
The steering factor λ is a crucial hyperparameter; values like 0.11−αˉt to $0.4 \sqrt{1 - \bar{\alpha}_t$ were explored. Higher values enforce the condition more strongly but can degrade detail.
The neighborhood distance threshold ζ was set to 13 pixels in experiments.
A significant computational cost arises because the VAE decoder and encoder (D and E) must be run within each of the 50 diffusion steps to perform the steering in image space.
Evaluation:
Dataset: NYUv2 test set (654 images).
Resolution: 608×448 (higher than standard NYUv2 evaluation).
Metrics: RMSE, MAE, REL (lower is better), δ1 (higher is better).
Protocol: Compared against BP-Net (2403.11270) and CompletionFormer (2306.09710) (strong depth completion baselines), and the original Marigold model (using sparse points only for final scaling via M).
Scenario 1 (Uniform Sparsity): Evaluated with 500, 2000, and 13620 randomly sampled depth points across the image.
Scenario 2 (Large Missing Area): The central 408×248 area had no depth points provided; points outside this area were sampled as before. Performance was measured on the full image, the central (erased) area, and a smaller inner area.
Results:
In the uniform sparsity scenario (Table 1), SteeredMarigold becomes competitive with BP-Net and CompletionFormer when using ~13k points and a sufficiently high steering factor (λ=0.41−αˉt).
In the large missing area scenario (Table 2, Figure 5), SteeredMarigold significantly outperforms the baselines. BP-Net and CompletionFormer fail to produce reasonable depth in the erased region, while SteeredMarigold provides plausible completion, demonstrating robustness to largely incomplete inputs.
SteeredMarigold also outperforms the base Marigold model even within the erased regions where no direct steering occurs. This suggests the diffusion process effectively propagates the constraints from the steered regions, harmonizing the overall depth map (Figure 4).
Limitations:
Computational Cost: The method is slow due to the iterative nature of diffusion and the need for VAE encode/decode within each step, making it unsuitable for real-time applications.
Efficiency: Eliminating the per-step encode/decode could significantly speed up inference.
Constraint Adherence: The method does not strictly guarantee that the output depth at the known locations will exactly match the input sparse depth values.
Generalization: Although built on a zero-shot model (Marigold), SteeredMarigold itself was primarily evaluated on NYUv2. Further testing on datasets like KITTI is needed to fully validate its zero-shot completion capabilities.
In summary, SteeredMarigold offers a promising training-free approach for depth completion, especially effective when dealing with the challenging scenario of large missing depth regions common in practical robotics applications. Its main drawback is the computational expense inherent in the guided diffusion process.