- The paper introduces Marigold, a diffusion-based method that adapts Stable Diffusion’s denoising U-Net for monocular depth estimation.
- It employs synthetic RGB-D data and annealed multi-resolution noise to enhance consistency and robustness across diverse environments.
- Test-time ensembling and rich visual priors yield over a 20% improvement in zero-shot transfer on major benchmarks.
Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation
The paper "Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation" presents a novel approach, titled Marigold, to tackle the problem of monocular depth estimation via diffusion models. The paper leverages the extensive and diverse visual knowledge encapsulated in pre-trained generative diffusion models, such as Stable Diffusion, to improve depth estimation's generalization capabilities across various domains. This research is of particular significance for scenarios where direct range or stereo measurements are unavailable, rendering the depth estimation problem geometrically ill-posed.
Context and Problem Definition
Monocular depth estimation is a fundamental task in computer vision, aiming to discern the 3D structure of a scene from a single 2D image by producing a depth map for each pixel. Traditional methods often rely on convolutional networks or vision transformers trained on extensive labeled datasets, which restrict their performance to the domain of the training data. The ability to generalize across diverse environments remains a challenge, partially due to unfamiliar visual content or configurations in new datasets. This paper posits that leveraging image diffusion models, which encode rich visual priors from vast image corpora, can address these issues.
Methodology
The proposed method, Marigold, is an affine-invariant monocular depth estimator based on the latent diffusion model framework. It involves a fine-tuning protocol of pre-trained Stable Diffusion models that focuses solely on adapting the pre-trained models for depth estimation tasks using synthetic RGB-D data. The usage of synthetic data avoids noise and incompleteness often present in real datasets. The authors introduce modifications in the process of utilizing diffusion models by maintaining the integrity of their latent space while fine-tuning only the denoising U-Net component.
A key innovation of Marigold is its use of an annealed multi-resolution noise during training, which improves model consistency and robustness. At the inference stage, Marigold employs a test-time ensembling strategy that refines its predictions by aggregating multiple inference passes, leveraging stochasticity to enhance accuracy.
Empirical Evaluation
The authors conduct extensive empirical evaluations across various real-world datasets, including NYUv2, KITTI, ETH3D, ScanNet, and DIODE. Marigold achieves state-of-the-art performance, with significant improvements in accuracy across different datasets. It demonstrates more than a 20% gain in specific zero-shot transfer tasks, affirming its superiority over existing depth estimation approaches that were trained on larger, more diverse real-world datasets.
Implications and Future Directions
The findings underscore the potential for using pre-trained diffusion models as powerful priors for depth estimation, capable of zero-shot generalization across numerous domains. This capability could obviate the need for large labeled datasets in diverse environments, saving significant time and resources. Furthermore, the paper sets the stage for future explorations into optimizing inference efficiency and enhancing inter-image prediction consistency.
Overall, the Marigold framework represents a promising step forward in leveraging generative models for supervised learning tasks. Future work could explore the interplay between generative priors and discriminative tasks to discover more efficient ways to exploit the expansive visual knowledge embedded in diffusion models for other computer vision applications.