ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth (2302.12288v1)

Published 23 Feb 2023 in cs.CV

Abstract: This paper tackles the problem of depth estimation from a single image. Existing work either focuses on generalization performance disregarding metric scale, i.e. relative depth estimation, or state-of-the-art results on specific datasets, i.e. metric depth estimation. We propose the first approach that combines both worlds, leading to a model with excellent generalization performance while maintaining metric scale. Our flagship model, ZoeD-M12-NK, is pre-trained on 12 datasets using relative depth and fine-tuned on two datasets using metric depth. We use a lightweight head with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier. Our framework admits multiple configurations depending on the datasets used for relative depth pre-training and metric fine-tuning. Without pre-training, we can already significantly improve the state of the art (SOTA) on the NYU Depth v2 indoor dataset. Pre-training on twelve datasets and fine-tuning on the NYU Depth v2 indoor dataset, we can further improve SOTA for a total of 21% in terms of relative absolute error (REL). Finally, ZoeD-M12-NK is the first model that can jointly train on multiple datasets (NYU Depth v2 and KITTI) without a significant drop in performance and achieve unprecedented zero-shot generalization performance to eight unseen datasets from both indoor and outdoor domains. The code and pre-trained models are publicly available at https://github.com/isl-org/ZoeDepth .

Authors (5)

Shariq Farooq Bhat (12 papers)
Reiner Birkl (6 papers)
Diana Wofk (7 papers)
Peter Wonka (130 papers)
Matthias Müller (41 papers)

Citations (344)

View on Semantic Scholar

Summary

Zero-Shot Depth Estimation Combining Relative and Metric Approaches: ZoeDepth Framework

The paper presents ZoeDepth, a novel approach to single-image depth estimation that bridges the gap between relative and metric depth estimation. The primary innovation lies in integrating pre-training on several relative depth datasets and fine-tuning on specific metric depth datasets, significantly enhancing both the generalization capabilities and metric accuracy of the model. This dual-stage training process leverages the strengths of both relative and metric depth estimations, enabling robust performance across varied domains.

Overview of Previous Work

Single-Image Depth Estimation (SIDE) has traditionally been bifurcated into two branches: Metric Depth Estimation (MDE) and Relative Depth Estimation (RDE). MDE predicts depth in absolute terms, essential for applications requiring precise spatial measurements. However, MDE models often struggle to generalize across diverse datasets due to differences in scales and contexts. In contrast, RDE produces depth estimates consistent relative to each other but without fixed metric meaning, permitting better generalization across datasets but limiting practical applications. Existing works primarily focus on improving one performance metric at the expense of the other, necessitating a comprehensive solution that can achieve both high generalization and metric accuracy.

Key Contributions

Dual-Stage Training Framework: ZoeDepth employs a two-stage training approach. Initially, an encoder-decoder architecture (specifically, MiDaS) is pre-trained using multiple relative depth datasets (M12), which enhances generalization capabilities. This is followed by attaching domain-specific metric heads and fine-tuning on metric depth datasets—specifically NYU Depth v2 and KITTI.
Metric Bins Module: Introduced to replace traditional regression heads, this module adopts an adaptive binning approach for metric depth estimation. Inspired by the LocalBins design, the approach involves predicting depth distributions at each pixel, followed by bin refinement through novel attractor layers, which adjust bin centers based on multi-scale features.
Log-Binomial Probability Distribution: Instead of the traditional softmax, ZoeDepth employs a log-binomial distribution for computing probabilistic depth estimates. This considers the ordinal relationship between depth bins, addressing over-smoothing issues encountered in prior methods.
Flexible Training Configurations and Routing: ZoeDepth supports multiple configurations for fine-tuning, allowing single or multiple metric heads for different datasets. An automatic routing mechanism directs input images to the relevant metric head based on learned latent features.

Experimental Results

Performance on NYU Depth v2: The model ZoeD-M12-N, trained in the proposed framework, outperforms state-of-the-art models like NeWCRFs by approximately 21% in terms of REL on the NYU Depth v2 dataset, demonstrating the efficacy of the training strategy and model architecture. Even without the extensive pre-training (ZoeD-X-N), ZoeDepth shows a considerable improvement of 13.7%, highlighting significant gains owing to the architecture alone.

Handling Multiple Datasets: ZoeDepth (ZoeD-M12-NK), trained across indoor (NYU Depth v2) and outdoor (KITTI) domains with two separate metric heads, shows unprecedented zero-shot performance. It outperforms state-of-the-art models not only on trained datasets but also across multiple unseen datasets like iBims-1 and DIML Outdoor. The architecture provides substantial improvements—up to 976.4% in REL (DIML Outdoor)—evidencing robust generalization across highly variable scenes.

Ablation Studies: Various ablation studies reveal that each architectural and methodological enhancement—including the metric bins module, attractor layers, and log-binomial probability distribution—contributes significantly to the model's overall performance. The flexibility in switching encoder backbones without hampering performance ensures that ZoeDepth can leverage advances in backbone architectures, such as larger or more efficient transformer models.

Implications and Future Work

ZoeDepth sets a new benchmark in the domain of single-image depth estimation by successfully combining the benefits of relative and metric depth estimation. The implications are substantial for applications requiring high generalization across diverse environments and precise depth measurements, such as autonomous driving, robotics, and augmented reality.

Future research directions include scaling the approach to more granular domain-specific training beyond just indoor-outdoor distinctions, potentially leading to even higher performance and better generalization. Additionally, extending this framework to stereo-image depth estimation presents an exciting avenue for further research.

In summary, ZoeDepth provides a comprehensive and flexible solution for depth estimation, setting a new standard for balancing generalization and metric accuracy across varied datasets and application domains.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - isl-org/ZoeDepth: Metric depth estimation from a single image (2,049 stars)

Tweets

https://twitter.com/aitheologian/status/1634778912596041730

YouTube

Show All Videos