- The paper introduces DMD, a diffusion-based model that leverages log-scale depth parameterization and FOV conditioning to address zero-shot metric depth estimation.
- It employs synthetic FOV augmentation and diverse indoor-outdoor datasets to handle unknown camera intrinsics and mitigate scale ambiguity.
- The model achieves significant reductions in relative depth error on benchmarks, outperforming current state-of-the-art methods.
Background
Monocular depth estimation is a crucial task in computer vision with applications ranging from mobile robotics to autonomous driving. This task generally involves predicting the distance from the camera to each point in the scene, using just one imageāa challenging problem due to the inherent ambiguity when inferring 3D information from 2D data. Past research has developed models that either specialize in indoor or outdoor environments but have struggled to accommodate both simultaneously. Moreover, these models face difficulty when camera intrinsics, which are vital for depth estimation, are not known.
The Diffusion Model Approach
A novel diffusion model, termed Diffusion for Metric Depth (DMD), is introduced to address the challenge of zero-shot metric depth estimation across varied settings. A diffusion model is a type of generative model that has shown promising results across various tasks in computer vision. To create a more universally applicable depth estimator, several advancements have been integrated into DMD:
- Log-scale depth parameterization is employed, allowing the model to more effectively represent both near and far distances common to indoor and outdoor scenes, respectively.
- The model is conditioned on the field-of-view (FOV) to address scale ambiguity, a unique problem surfacing from the absence of known camera intrinsics.
- To enhance the ability of the model to generalize beyond the specific cameras used in training datasets, synthetic FOV augmentation is conducted during training.
- By utilizing a diverse mixture of training data and efficient diffusion parameterization, DMD outperforms the state-of-the-art models on zero-shot benchmarks, achieving significant reductions in relative depth error.
Model Training and Performance
DMD's training involves a mixture of indoor and outdoor datasets, with a focus on ensuring the model is exposed to diverse camera perspectives. Furthermore, the choice to condition the model on the vertical field-of-view enables it to infer scale appropriately and handle unknown camera intrinsic characteristics more robustly. A visual overview of the enhancements brought about by these model features is illustrated in the paper, showcasing DMD's improvements over the current best models represented as quantitative improvements in relative depth error for various datasets.
Conclusions and Contributions
The introduction of the DMD model provides a generalized approach to the problem of zero-shot metric depth estimation. The model's dependencies on log-scale depth representation, FOV augmentation, and conditioning pave the way for better utilization of the model's capacity, robustness to a range of camera intrinsics, and overall improvements in the accuracy of depth estimates. The paper demonstrates how these contributions have led to setting a new standard in depth estimation, showing significant performance gains over contemporary works.