- The paper introduces Depth Pro, a model that produces high-resolution, zero-shot metric depth maps with unmatched sharpness in just 0.3 seconds.
- It employs an efficient multi-scale Vision Transformer with a hybrid training protocol to capture global context and fine boundary details.
- The model achieves state-of-the-art cross-domain performance, making it ideal for applications in VR, AR, and advanced image editing.
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
The paper presents Depth Pro, a model for zero-shot metric monocular depth estimation. Depth Pro claims to synthesize high-resolution depth maps with unmatched sharpness and high-frequency detail, achieving absolute scale without metadata such as camera intrinsics. The model's performance is notable, producing a 2.25-megapixel depth map in just 0.3 seconds using a standard GPU.
Key Contributions
The development of Depth Pro is supported by several technical innovations:
- Efficient Multi-Scale Vision Transformer: The model uses a multi-scale ViT architecture for dense prediction, allowing it to capture both global context and fine image details.
- Hybrid Training Protocol: The training process combines real and synthetic datasets, ensuring high metric accuracy along with detailed boundary tracing.
- Boundary Accuracy Evaluation Metrics: New metrics are introduced to assess the boundary accuracy in depth maps, a critical factor for applications like novel view synthesis.
- Advanced Focal Length Estimation: The model includes a state-of-the-art method for estimating focal length from a single image, outperforming previous approaches in cross-domain evaluation tasks.
Results and Analysis
Depth Pro is evaluated against other state-of-the-art systems on multiple datasets, including Booster, Middlebury, Sun-RGBD, and nuScenes, demonstrating superior zero-shot metric depth accuracy. Notably, it scores best in average rank across datasets, highlighting its strong generalization capabilities.
In boundary accuracy tests, Depth Pro outperformed methods with both metric and relative depth estimations. The evaluation involved datasets focusing on high-frequency structures like hair and vegetation, where Depth Pro consistently achieved higher recall. This capability is vital for applications needing precise occlusion boundary identification.
The runtime of Depth Pro is also commendable. Despite producing significantly larger native resolution outputs than many competitors, it maintains a relatively low execution time, ensuring practical applicability in real-world scenarios.
Implications and Future Directions
Depth Pro's advancements in zero-shot metric depth estimation have significant implications for fields requiring high-precision depth maps, such as virtual reality and augmented reality. The accurate and sharp depth maps support enhanced image editing, improved view synthesis, and better conditional image generation.
Future research could delve into optimizing and expanding Depth Pro's capabilities to address current limitations, like handling translucent surfaces and volumetric phenomena. Additionally, exploring further applications and integration in various real-time systems is a promising direction.
Overall, Depth Pro represents a competitive step forward in the domain of monocular depth estimation, promising to enhance both theoretical understanding and practical execution in AI-related fields.