Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera (2501.02464v1)

Published 5 Jan 2025 in cs.CV, cs.AI, and cs.RO

Abstract: While recent depth estimation methods exhibit strong zero-shot generalization, achieving accurate metric depth across diverse camera types-particularly those with large fields of view (FoV) such as fisheye and 360-degree cameras-remains a significant challenge. This paper presents Depth Any Camera (DAC), a powerful zero-shot metric depth estimation framework that extends a perspective-trained model to effectively handle cameras with varying FoVs. The framework is designed to ensure that all existing 3D data can be leveraged, regardless of the specific camera types used in new applications. Remarkably, DAC is trained exclusively on perspective images but generalizes seamlessly to fisheye and 360-degree cameras without the need for specialized training data. DAC employs Equi-Rectangular Projection (ERP) as a unified image representation, enabling consistent processing of images with diverse FoVs. Its key components include a pitch-aware Image-to-ERP conversion for efficient online augmentation in ERP space, a FoV alignment operation to support effective training across a wide range of FoVs, and multi-resolution data augmentation to address resolution disparities between training and testing. DAC achieves state-of-the-art zero-shot metric depth estimation, improving delta-1 ($\delta_1$) accuracy by up to 50% on multiple fisheye and 360-degree datasets compared to prior metric depth foundation models, demonstrating robust generalization across camera types.

Authors (5)

Yuliang Guo (18 papers)
Sparsh Garg (5 papers)
S. Mahdi H. Miangoleh (5 papers)
Xinyu Huang (75 papers)
Liu Ren (57 papers)

Summary

Zero-Shot Metric Depth Estimation from Any Camera

The paper "Depth Any Camera: Zero-Shot Metric Depth Estimation from Any Camera" presents a novel framework termed Depth Any Camera (DAC) that enhances zero-shot metric depth estimation across diverse camera types, particularly those with large Fields of View (FoV), such as fisheye and 360-degree cameras. Existing approaches for depth estimation have demonstrated significant capabilities in zero-shot generalization, albeit primarily on perspective images. DAC addresses the complex challenge of maintaining accuracy across various camera FoVs without necessitating specific training on fisheye or 360-degree images.

Methodological Contributions

DAC introduces a robust zero-shot metric depth estimation method that efficiently translates the depth estimation model trained solely on perspective images to handle large FoV cameras. This is achieved without requiring specialized training datasets for non-standard camera types. The core innovations of DAC can be summarized as follows:

Unified Image Representation with ERP: DAC uses Equi-Rectangular Projection (ERP) to represent images universally, accommodating varying FoVs under a consistent transformation framework.
Pitch-aware Image-to-ERP Conversion: This technique aids in converting perspective images into ERP patches efficiently, factoring in the camera pitch to boost model generalization capabilities.
FoV Alignment: This process normalizes the effective training experience across different FoVs by adjusting training data to a standard ERP patch size, thereby reducing computational redundancies.
Multi-Resolution Training: By addressing resolution mismatches between training and testing, this mechanism allows scale-equivariant feature learning, promoting adaptability to diverse testing conditions.

Experimental Evaluation

DAC was rigorously evaluated against state-of-the-art methodologies, such as Metric3Dv2 and UniDepth, using both indoor and outdoor datasets across various camera types. DAC consistently demonstrated superior zero-shot performance, prominently enhancing delta-1 accuracy by up to 50% on multiple fisheye and 360-degree datasets, including Matterport3D and Pano3D-GV2. These empirical results underline DAC's capacity to handle various camera configurations effectively, surpassing existing benchmarks regardless of the dataset's original task-specific training scope.

Practical and Theoretical Implications

Practically, DAC offers a comprehensive solution where existing 3D datasets remain valuable across new applications irrespective of the camera type employed, which is crucial for sectors like autonomous driving and augmented reality. Theoretically, DAC's ERP-based image representation provides new avenues for exploring zero-shot capabilities by bridging distinct FoV domains into a cohesive understanding, potentially informing future architectures in contrastive and self-supervised learning settings.

Future Directions

The DAC framework positions itself as a foundational piece in the future development of vision systems requiring cross-domain adaptability. Future investigations may focus on refining the ERP conversion process for enhanced precision, adapting DAC methodologies to more aggressive real-time processing environments, and extensive exploration in integrating DAC with more complex network architectures, such as transformers, to further improve scale-equivariant learning dynamics.

In conclusion, DAC makes a remarkable stride in the field of depth estimation by demonstrating robust zero-shot generalization across varying camera types—a crucial advancement for the continuous integration and evolution of visual AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Chandra88Moon/status/1877692765686862039

https://twitter.com/_sparshgarg_/status/1895286706733228343