PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage (2409.09144v1)

Published 13 Sep 2024 in cs.CV

Abstract: This work addresses the task of zero-shot monocular depth estimation. A recent advance in this field has been the idea of utilising Text-to-Image foundation models, such as Stable Diffusion. Foundation models provide a rich and generic image representation, and therefore, little training data is required to reformulate them as a depth estimation model that predicts highly-detailed depth maps and has good generalisation capabilities. However, the realisation of this idea has so far led to approaches which are, unfortunately, highly inefficient at test-time due to the underlying iterative denoising process. In this work, we propose a different realisation of this idea and present PrimeDepth, a method that is highly efficient at test time while keeping, or even enhancing, the positive aspects of diffusion-based approaches. Our key idea is to extract from Stable Diffusion a rich, but frozen, image representation by running a single denoising step. This representation, we term preimage, is then fed into a refiner network with an architectural inductive bias, before entering the downstream task. We validate experimentally that PrimeDepth is two orders of magnitude faster than the leading diffusion-based method, Marigold, while being more robust for challenging scenarios and quantitatively marginally superior. Thereby, we reduce the gap to the currently leading data-driven approach, Depth Anything, which is still quantitatively superior, but predicts less detailed depth maps and requires 20 times more labelled data. Due to the complementary nature of our approach, even a simple averaging between PrimeDepth and Depth Anything predictions can improve upon both methods and sets a new state-of-the-art in zero-shot monocular depth estimation. In future, data-driven approaches may also benefit from integrating our preimage.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces a single-step diffusion method that extracts a preimage from a Stable Diffusion model to achieve efficient zero-shot monocular depth estimation.
The proposed refiner network leverages multi-scale features and attention maps to integrate rich representations, enhancing training efficiency and depth prediction quality.
Experimental results across datasets show PrimeDepth is over 100 times faster than prior methods while maintaining detailed depth maps and robustness in challenging conditions.

PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage

The paper "PrimeDepth: Efficient Monocular Depth Estimation with a Stable Diffusion Preimage," authored by Denis Zavadski, Damjan Kalšan, and Carsten Rother from the Computer Vision and Learning Lab at Heidelberg University, proposes a novel method for zero-shot monocular depth estimation using Stable Diffusion (SD) models. This work addresses current inefficiencies in diffusion-based depth estimation methods while retaining their strengths and enhancing computational efficiency.

The Core Proposition

The primary contribution of this paper is the introduction of a method termed "PrimeDepth," which utilizes a single-step diffusion process to estimate depth from monocular images. Unlike previous diffusion-based approaches such as Marigold, which are noted for their computational burden due to iterative denoising steps, PrimeDepth streamlines this process. It extracts a "preimage" during a single denoising step, leveraging the rich, inherent representations encoded by the SD model. This preimage is then processed through a refiner network designed with an architectural inductive bias to yield high-quality depth predictions.

Key Methodological Innovations

Preimage Extraction: The preimage used in PrimeDepth encompasses multi-scale feature maps along with cross- and self-attention maps derived from the final denoising step of SD. The processing of these maps is computationally feasible and avoids the iterative nature of previous methods.
Refiner Network: This network is architecturally biased to integrate the preimage features at different stages, mirroring the hierarchical nature of the SD model's intermediate representations. The design ensures that the rich information from the SD model is used effectively without redundancy or loss of detail.
Loss Functions in Pixel Domain: The decision to compute loss functions directly in the pixel domain, as opposed to the latent domain used by Marigold, enables PrimeDepth to train the depth prediction module concurrently. This enhances the training efficiency and effectiveness.

Experimental Evaluation

The experimental section thoroughly validates the proposed approach across multiple datasets, including KITTI, NYUv2, ETH3D, rabbitai, and a curated subset of nuScenes termed nuScenes-C. These evaluations reveal two major findings:

Efficiency: PrimeDepth is shown to be over 100 times faster than Marigold, making it highly suitable for real-time applications.
Performance: While Depth Anything remains quantitatively superior, PrimeDepth produces more detailed depth maps and is notably more robust, especially in challenging scenarios like nighttime scenes.

Comparative Analysis

PrimeDepth is contrasted with state-of-the-art methods such as Depth Anything and Marigold. Depth Anything, while superior in terms of absolute performance metrics, relies on an extensive training dataset of 1.5 million labeled images. PrimeDepth, in contrast, achieves comparable performance using only 74,000 synthetic training images.

Implications and Future Work

The results underscore the potential for integrating preimage representations from generative models into other downstream tasks, suggesting that future methods could benefit from a similar approach. Additionally, the complementary nature of data-driven and diffusion-based approaches opens up promising avenues for future research, especially in tasks requiring robust generalization across diverse domains and conditions.

Conclusion

PrimeDepth sets a new standard for zero-shot monocular depth estimation by effectively harnessing the rich representations of pre-trained SD models. Its innovative use of a single-step diffusion process coupled with an architecturally biased refiner network bridges the gap between computational efficiency and high-quality depth estimation, highlighting a significant advancement in the field.

In summary, the PrimeDepth approach not only advances zero-shot monocular depth estimation but also illustrates the broader potential of leveraging generative models in solving intricate vision tasks efficiently. This work prompts further exploration into the integration of preimage representations into other model architectures and downstream applications, paving the way for more versatile and robust AI solutions.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ducha_aiki/status/1836025619370463310