Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation (2312.02145v2)

Published 4 Dec 2023 in cs.CV

Abstract: Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding, so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity, from relatively modest CNNs to large Transformer architectures. Still, monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout, since their knowledge of the visual world is restricted by the data seen during training, and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better, more generalizable depth estimation. We introduce Marigold, a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets, including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io.

Citations (87)

View on Semantic Scholar

Summary

The paper introduces Marigold, a diffusion-based method that adapts Stable Diffusion’s denoising U-Net for monocular depth estimation.
It employs synthetic RGB-D data and annealed multi-resolution noise to enhance consistency and robustness across diverse environments.
Test-time ensembling and rich visual priors yield over a 20% improvement in zero-shot transfer on major benchmarks.

Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

The paper "Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation" presents a novel approach, titled Marigold, to tackle the problem of monocular depth estimation via diffusion models. The paper leverages the extensive and diverse visual knowledge encapsulated in pre-trained generative diffusion models, such as Stable Diffusion, to improve depth estimation's generalization capabilities across various domains. This research is of particular significance for scenarios where direct range or stereo measurements are unavailable, rendering the depth estimation problem geometrically ill-posed.

Context and Problem Definition

Monocular depth estimation is a fundamental task in computer vision, aiming to discern the 3D structure of a scene from a single 2D image by producing a depth map for each pixel. Traditional methods often rely on convolutional networks or vision transformers trained on extensive labeled datasets, which restrict their performance to the domain of the training data. The ability to generalize across diverse environments remains a challenge, partially due to unfamiliar visual content or configurations in new datasets. This paper posits that leveraging image diffusion models, which encode rich visual priors from vast image corpora, can address these issues.

Methodology

The proposed method, Marigold, is an affine-invariant monocular depth estimator based on the latent diffusion model framework. It involves a fine-tuning protocol of pre-trained Stable Diffusion models that focuses solely on adapting the pre-trained models for depth estimation tasks using synthetic RGB-D data. The usage of synthetic data avoids noise and incompleteness often present in real datasets. The authors introduce modifications in the process of utilizing diffusion models by maintaining the integrity of their latent space while fine-tuning only the denoising U-Net component.

A key innovation of Marigold is its use of an annealed multi-resolution noise during training, which improves model consistency and robustness. At the inference stage, Marigold employs a test-time ensembling strategy that refines its predictions by aggregating multiple inference passes, leveraging stochasticity to enhance accuracy.

Empirical Evaluation

The authors conduct extensive empirical evaluations across various real-world datasets, including NYUv2, KITTI, ETH3D, ScanNet, and DIODE. Marigold achieves state-of-the-art performance, with significant improvements in accuracy across different datasets. It demonstrates more than a 20% gain in specific zero-shot transfer tasks, affirming its superiority over existing depth estimation approaches that were trained on larger, more diverse real-world datasets.

Implications and Future Directions

The findings underscore the potential for using pre-trained diffusion models as powerful priors for depth estimation, capable of zero-shot generalization across numerous domains. This capability could obviate the need for large labeled datasets in diverse environments, saving significant time and resources. Furthermore, the paper sets the stage for future explorations into optimizing inference efficiency and enhancing inter-image prediction consistency.

Overall, the Marigold framework represents a promising step forward in leveraging generative models for supervised learning tasks. Future work could explore the interplay between generative priors and discriminative tasks to discover more efficient ways to exploit the expansive visual knowledge embedded in diffusion models for other computer vision applications.

PDF Markdown

Related Papers

GitHub

Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation

Tweets

https://twitter.com/14285438/status/1733255960179954042

https://twitter.com/1565330182176911367/status/1734826627152990579

https://twitter.com/SmartFlowAITeam/status/1795354216963408063

https://twitter.com/LiveFromVR/status/1773002672963981571

https://twitter.com/dimid_ml/status/1777713086474125762

https://twitter.com/zeeshanp_/status/1853998365496983951

YouTube

Show All Videos