Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Zero-Shot Metric Depth with a Field-of-View Conditioned Diffusion Model (2312.13252v1)

Published 20 Dec 2023 in cs.CV

Abstract: While methods for monocular depth estimation have made significant strides on standard benchmarks, zero-shot metric depth estimation remains unsolved. Challenges include the joint modeling of indoor and outdoor scenes, which often exhibit significantly different distributions of RGB and depth, and the depth-scale ambiguity due to unknown camera intrinsics. Recent work has proposed specialized multi-head architectures for jointly modeling indoor and outdoor scenes. In contrast, we advocate a generic, task-agnostic diffusion model, with several advancements such as log-scale depth parameterization to enable joint modeling of indoor and outdoor scenes, conditioning on the field-of-view (FOV) to handle scale ambiguity and synthetically augmenting FOV during training to generalize beyond the limited camera intrinsics in training datasets. Furthermore, by employing a more diverse training mixture than is common, and an efficient diffusion parameterization, our method, DMD (Diffusion for Metric Depth) achieves a 25\% reduction in relative error (REL) on zero-shot indoor and 33\% reduction on zero-shot outdoor datasets over the current SOTA using only a small number of denoising steps. For an overview see https://diffusion-vision.github.io/dmd

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Saurabh Saxena (15 papers)
  2. Junhwa Hur (20 papers)
  3. Charles Herrmann (33 papers)
  4. Deqing Sun (68 papers)
  5. David J. Fleet (47 papers)
Citations (20)

Summary

  • The paper introduces DMD, a diffusion-based model that leverages log-scale depth parameterization and FOV conditioning to address zero-shot metric depth estimation.
  • It employs synthetic FOV augmentation and diverse indoor-outdoor datasets to handle unknown camera intrinsics and mitigate scale ambiguity.
  • The model achieves significant reductions in relative depth error on benchmarks, outperforming current state-of-the-art methods.

Background

Monocular depth estimation is a crucial task in computer vision with applications ranging from mobile robotics to autonomous driving. This task generally involves predicting the distance from the camera to each point in the scene, using just one image—a challenging problem due to the inherent ambiguity when inferring 3D information from 2D data. Past research has developed models that either specialize in indoor or outdoor environments but have struggled to accommodate both simultaneously. Moreover, these models face difficulty when camera intrinsics, which are vital for depth estimation, are not known.

The Diffusion Model Approach

A novel diffusion model, termed Diffusion for Metric Depth (DMD), is introduced to address the challenge of zero-shot metric depth estimation across varied settings. A diffusion model is a type of generative model that has shown promising results across various tasks in computer vision. To create a more universally applicable depth estimator, several advancements have been integrated into DMD:

  • Log-scale depth parameterization is employed, allowing the model to more effectively represent both near and far distances common to indoor and outdoor scenes, respectively.
  • The model is conditioned on the field-of-view (FOV) to address scale ambiguity, a unique problem surfacing from the absence of known camera intrinsics.
  • To enhance the ability of the model to generalize beyond the specific cameras used in training datasets, synthetic FOV augmentation is conducted during training.
  • By utilizing a diverse mixture of training data and efficient diffusion parameterization, DMD outperforms the state-of-the-art models on zero-shot benchmarks, achieving significant reductions in relative depth error.

Model Training and Performance

DMD's training involves a mixture of indoor and outdoor datasets, with a focus on ensuring the model is exposed to diverse camera perspectives. Furthermore, the choice to condition the model on the vertical field-of-view enables it to infer scale appropriately and handle unknown camera intrinsic characteristics more robustly. A visual overview of the enhancements brought about by these model features is illustrated in the paper, showcasing DMD's improvements over the current best models represented as quantitative improvements in relative depth error for various datasets.

Conclusions and Contributions

The introduction of the DMD model provides a generalized approach to the problem of zero-shot metric depth estimation. The model's dependencies on log-scale depth representation, FOV augmentation, and conditioning pave the way for better utilization of the model's capacity, robustness to a range of camera intrinsics, and overall improvements in the accuracy of depth estimates. The paper demonstrates how these contributions have led to setting a new standard in depth estimation, showing significant performance gains over contemporary works.

Github Logo Streamline Icon: https://streamlinehq.com