Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second (2410.02073v2)

Published 2 Oct 2024 in cs.CV and cs.LG

Abstract: We present a foundation model for zero-shot metric monocular depth estimation. Our model, Depth Pro, synthesizes high-resolution depth maps with unparalleled sharpness and high-frequency details. The predictions are metric, with absolute scale, without relying on the availability of metadata such as camera intrinsics. And the model is fast, producing a 2.25-megapixel depth map in 0.3 seconds on a standard GPU. These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction, a training protocol that combines real and synthetic datasets to achieve high metric accuracy alongside fine boundary tracing, dedicated evaluation metrics for boundary accuracy in estimated depth maps, and state-of-the-art focal length estimation from a single image. Extensive experiments analyze specific design choices and demonstrate that Depth Pro outperforms prior work along multiple dimensions. We release code and weights at https://github.com/apple/ml-depth-pro

Citations (15)

Summary

  • The paper introduces Depth Pro, a model that produces high-resolution, zero-shot metric depth maps with unmatched sharpness in just 0.3 seconds.
  • It employs an efficient multi-scale Vision Transformer with a hybrid training protocol to capture global context and fine boundary details.
  • The model achieves state-of-the-art cross-domain performance, making it ideal for applications in VR, AR, and advanced image editing.

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

The paper presents Depth Pro, a model for zero-shot metric monocular depth estimation. Depth Pro claims to synthesize high-resolution depth maps with unmatched sharpness and high-frequency detail, achieving absolute scale without metadata such as camera intrinsics. The model's performance is notable, producing a 2.25-megapixel depth map in just 0.3 seconds using a standard GPU.

Key Contributions

The development of Depth Pro is supported by several technical innovations:

  1. Efficient Multi-Scale Vision Transformer: The model uses a multi-scale ViT architecture for dense prediction, allowing it to capture both global context and fine image details.
  2. Hybrid Training Protocol: The training process combines real and synthetic datasets, ensuring high metric accuracy along with detailed boundary tracing.
  3. Boundary Accuracy Evaluation Metrics: New metrics are introduced to assess the boundary accuracy in depth maps, a critical factor for applications like novel view synthesis.
  4. Advanced Focal Length Estimation: The model includes a state-of-the-art method for estimating focal length from a single image, outperforming previous approaches in cross-domain evaluation tasks.

Results and Analysis

Depth Pro is evaluated against other state-of-the-art systems on multiple datasets, including Booster, Middlebury, Sun-RGBD, and nuScenes, demonstrating superior zero-shot metric depth accuracy. Notably, it scores best in average rank across datasets, highlighting its strong generalization capabilities.

In boundary accuracy tests, Depth Pro outperformed methods with both metric and relative depth estimations. The evaluation involved datasets focusing on high-frequency structures like hair and vegetation, where Depth Pro consistently achieved higher recall. This capability is vital for applications needing precise occlusion boundary identification.

The runtime of Depth Pro is also commendable. Despite producing significantly larger native resolution outputs than many competitors, it maintains a relatively low execution time, ensuring practical applicability in real-world scenarios.

Implications and Future Directions

Depth Pro's advancements in zero-shot metric depth estimation have significant implications for fields requiring high-precision depth maps, such as virtual reality and augmented reality. The accurate and sharp depth maps support enhanced image editing, improved view synthesis, and better conditional image generation.

Future research could delve into optimizing and expanding Depth Pro's capabilities to address current limitations, like handling translucent surfaces and volumetric phenomena. Additionally, exploring further applications and integration in various real-time systems is a promising direction.

Overall, Depth Pro represents a competitive step forward in the domain of monocular depth estimation, promising to enhance both theoretical understanding and practical execution in AI-related fields.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com