ZoeDepth: Unified Monocular Depth Estimation
- ZoeDepth is a monocular depth estimation framework that unifies scale-invariant pre-training with domain-specific metric fine-tuning to achieve both robust zero-shot transfer and precise metric predictions.
- It employs a two-stage transformer-based encoder–decoder architecture that integrates innovative multi-scale bin refinement and attractor layer mechanisms for accurate depth estimation.
- Its modular design with automatic domain routing and binomial probability modeling enables reliable performance across diverse applications such as autonomous navigation and augmented reality.
ZoeDepth is a monocular depth estimation framework designed to bridge the gap between relative depth generalization and metric depth precision. It introduces a unified approach that combines scale-invariant pre-training on diverse datasets with domain-specific metric fine-tuning, delivering both robust zero-shot transfer capabilities and high-fidelity metric depth predictions. Key technical innovations include a modular architecture featuring a shared encoder–decoder (based on transformer backbones), novel multi-scale bin-based metric heads, attractor layer refinement, and an automatic domain routing mechanism. ZoeDepth achieves competitive or state-of-the-art results on varied benchmarks, including NYU Depth v2 and KITTI, and demonstrates unprecedented zero-shot generalization to unseen indoor and outdoor datasets.
1. Design Principles and Architecture
ZoeDepth employs a two-stage encoder–decoder structure. The initial pre-training stage utilizes a MiDaS-style network with a transformer backbone (e.g., BEiT-L, Swin Transformer) trained on scale-invariant losses across twelve heterogeneous datasets. The decoder aggregates multi-scale image features, enabling the extraction of robust geometric representations agnostic to scene type.
Domain-specific metric heads—each realized via a metric bins module—are attached during fine-tuning. Metric heads are lightweight, facilitating efficient training and inference. Each head predicts a discrete set of depth bin centers , which are subsequently refined using several attractor points drawn from decoder feature pyramids. The bin refinement process operates as follows (inverse attractor variant):
where and control attraction strength. This multi-scale adjustment contracts bin centers towards higher accuracy without explicit hierarchical splitting.
The network further predicts a unimodal, ordering-aware bin probability distribution using a binomial model:
where , and optionally a temperature , are learned for each pixel. The metric depth at pixel is computed as a weighted sum:
Multi-domain deployments (e.g., NYU for indoor, KITTI for outdoor) are handled by associating separate metric heads with a latent classifier for automatic routing at inference.
2. Training Methodology
Training is organized in two stages:
- Stage 1: Relative depth pre-training on twelve datasets (indoor, outdoor, synthetic, real) using scale-invariant log losses. This enables the encoder–decoder to learn geometric abstractions invariant to global scale or dataset-specific biases:
- Stage 2: Metric fine-tuning on datasets with available ground-truth metric depth (typically NYU Depth v2 and KITTI). Each domain-specific head is fine-tuned separately or jointly, with a router determining selection during inference.
The combination of diverse pre-training and minimal metric fine-tuning enables accurate scale reintroduction on target domains, while preserving the generalization benefits of training on heterogeneous data.
3. The Metric Bins Module: Attractor Layers and Probability Formulation
The metric bins module refines a set of candidate bin centers by integrating multi-scale attractor information. Rather than hierarchical splitting (such as AdaBins), ZoeDepth refines all bins at the decoder bottleneck and contracts them through attractor layers.
The final depth prediction uses a binomial probability model instead of softmax, explicitly exploiting the ordinal structure of depth and generating sharply unimodal, order-preserving distributions. For numerical stability, log-probabilities are computed via Stirling’s approximation, and temperature scaling is used as necessary.
This module is central to ZoeDepth’s ability to reintroduce metric scale in a way that is efficient, robust against varying depth distributions across domains, and resistant to common binning artifacts.
4. Quantitative and Qualitative Performance
On the NYU Depth v2 dataset, ZoeDepth in configuration ZoeD-X-N (no relative pre-training) achieves REL = 0.082, improving by 13.7% over NeWCRFs (REL = 0.095). With full relative pre-training (ZoeD-M12-N), REL drops to 0.075 (a total improvement of nearly 21%). Threshold accuracies , , and are consistently higher (e.g., = 0.955) than prior baselines.
In multi-domain joint training (ZoeD-M12-NK), performance drop on NYU is modest, while zero-shot accuracy on eight unseen datasets is unprecedented, including up to 976% improvement over NeWCRFs on the DIML Outdoor dataset. ZoeDepth demonstrates robust cross-domain transfer, delivering reliable metric depth predictions without further fine-tuning.
5. Zero-Shot Generalization and Routing
The combination of relative depth pre-training and domain-specific metric heads allows ZoeDepth to generalize to entirely unseen data distributions, scene geometries, and camera parameters. Automatic routing via a latent classifier or simple multi-layer perceptron (MLP) determines the optimal metric head, avoiding the pitfalls of applying a generic model to both indoor and outdoor scenes.
Adaptive bin refinement and attractor-based contraction allow the model to accommodate dataset and domain shifts while maintaining reliable metric predictions and boundary delineation.
6. Technical Implementation and Mathematical Details
Key technical elements include:
- Transformer-based encoder
- Multi-scale feature aggregation in the decoder
- Attractor-based bin refinement (inverse and exponential variants)
- Binomial probability modeling for depth prediction
- Modular domain routing for zero-shot adaptation
The main equations governing predictions are:
and attractor bin refinement
with bin probabilities as
7. Applications, Limitations, and Comparisons
ZoeDepth’s dense metric depth estimation supports diverse applications: autonomous navigation, augmented reality, virtual reality, scene reconstruction, and photo editing. Its robust zero-shot generalization makes it especially suitable for environments where domain-specific fine-tuning is infeasible.
Subsequent research identified competitive and superior approaches in specialized settings (e.g., ECoDepth’s ViT-based diffusion conditioning (Patni et al., 27 Mar 2024), PatchFusion’s high-resolution fusion (Li et al., 2023), and wildlife monitoring benchmarks (Niccoli et al., 6 Oct 2025)). While ZoeDepth offers rapid inference (0.17 s per image), its spatial accuracy can degrade in challenging outdoor environments—recording a mean absolute error of 3.087 m on wildlife camera trap images, compared to 0.454 m from Depth Anything V2 (Niccoli et al., 6 Oct 2025). Median-based depth aggregation is recommended for improved robustness in noisy conditions.
The metric bins module and latent routing are often adapted for downstream tasks such as 3D occupancy prediction (Zheng et al., 17 Jul 2024), NeRF-based scene editing (Guo et al., 1 May 2024), and scene enhancement in degraded domains (Huang et al., 27 Apr 2024).
Summary Table: ZoeDepth Configuration Performance
Configuration | Relative Pre-training | Metric Fine-tuning | REL (NYU) | Zero-shot generalization |
---|---|---|---|---|
ZoeD-X-N | No | NYU | 0.082 | Moderate |
ZoeD-M12-N | M12 (12 datasets) | NYU | 0.075 | Excellent |
ZoeD-M12-NK | M12 | NYU + KITTI | ~0.078 | Unprecedented |
ZoeDepth’s modular approach and technical innovations have established new standards for cross-domain metric depth estimation, with the caveat that accuracy may be limited in highly unstructured outdoor environments unless further domain adaptation is performed.