Zero-Shot Depth Estimation Combining Relative and Metric Approaches: ZoeDepth Framework
The paper presents ZoeDepth, a novel approach to single-image depth estimation that bridges the gap between relative and metric depth estimation. The primary innovation lies in integrating pre-training on several relative depth datasets and fine-tuning on specific metric depth datasets, significantly enhancing both the generalization capabilities and metric accuracy of the model. This dual-stage training process leverages the strengths of both relative and metric depth estimations, enabling robust performance across varied domains.
Overview of Previous Work
Single-Image Depth Estimation (SIDE) has traditionally been bifurcated into two branches: Metric Depth Estimation (MDE) and Relative Depth Estimation (RDE). MDE predicts depth in absolute terms, essential for applications requiring precise spatial measurements. However, MDE models often struggle to generalize across diverse datasets due to differences in scales and contexts. In contrast, RDE produces depth estimates consistent relative to each other but without fixed metric meaning, permitting better generalization across datasets but limiting practical applications. Existing works primarily focus on improving one performance metric at the expense of the other, necessitating a comprehensive solution that can achieve both high generalization and metric accuracy.
Key Contributions
- Dual-Stage Training Framework: ZoeDepth employs a two-stage training approach. Initially, an encoder-decoder architecture (specifically, MiDaS) is pre-trained using multiple relative depth datasets (M12), which enhances generalization capabilities. This is followed by attaching domain-specific metric heads and fine-tuning on metric depth datasets—specifically NYU Depth v2 and KITTI.
- Metric Bins Module: Introduced to replace traditional regression heads, this module adopts an adaptive binning approach for metric depth estimation. Inspired by the LocalBins design, the approach involves predicting depth distributions at each pixel, followed by bin refinement through novel attractor layers, which adjust bin centers based on multi-scale features.
- Log-Binomial Probability Distribution: Instead of the traditional softmax, ZoeDepth employs a log-binomial distribution for computing probabilistic depth estimates. This considers the ordinal relationship between depth bins, addressing over-smoothing issues encountered in prior methods.
- Flexible Training Configurations and Routing: ZoeDepth supports multiple configurations for fine-tuning, allowing single or multiple metric heads for different datasets. An automatic routing mechanism directs input images to the relevant metric head based on learned latent features.
Experimental Results
Performance on NYU Depth v2: The model ZoeD-M12-N, trained in the proposed framework, outperforms state-of-the-art models like NeWCRFs by approximately 21% in terms of REL on the NYU Depth v2 dataset, demonstrating the efficacy of the training strategy and model architecture. Even without the extensive pre-training (ZoeD-X-N), ZoeDepth shows a considerable improvement of 13.7%, highlighting significant gains owing to the architecture alone.
Handling Multiple Datasets: ZoeDepth (ZoeD-M12-NK), trained across indoor (NYU Depth v2) and outdoor (KITTI) domains with two separate metric heads, shows unprecedented zero-shot performance. It outperforms state-of-the-art models not only on trained datasets but also across multiple unseen datasets like iBims-1 and DIML Outdoor. The architecture provides substantial improvements—up to 976.4% in REL (DIML Outdoor)—evidencing robust generalization across highly variable scenes.
Ablation Studies: Various ablation studies reveal that each architectural and methodological enhancement—including the metric bins module, attractor layers, and log-binomial probability distribution—contributes significantly to the model's overall performance. The flexibility in switching encoder backbones without hampering performance ensures that ZoeDepth can leverage advances in backbone architectures, such as larger or more efficient transformer models.
Implications and Future Work
ZoeDepth sets a new benchmark in the domain of single-image depth estimation by successfully combining the benefits of relative and metric depth estimation. The implications are substantial for applications requiring high generalization across diverse environments and precise depth measurements, such as autonomous driving, robotics, and augmented reality.
Future research directions include scaling the approach to more granular domain-specific training beyond just indoor-outdoor distinctions, potentially leading to even higher performance and better generalization. Additionally, extending this framework to stereo-image depth estimation presents an exciting avenue for further research.
In summary, ZoeDepth provides a comprehensive and flexible solution for depth estimation, setting a new standard for balancing generalization and metric accuracy across varied datasets and application domains.