Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 112 tok/s Pro
Kimi K2 208 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

ZoeDepth: Unified Monocular Depth Estimation

Updated 13 October 2025
  • ZoeDepth is a monocular depth estimation framework that unifies scale-invariant pre-training with domain-specific metric fine-tuning to achieve both robust zero-shot transfer and precise metric predictions.
  • It employs a two-stage transformer-based encoder–decoder architecture that integrates innovative multi-scale bin refinement and attractor layer mechanisms for accurate depth estimation.
  • Its modular design with automatic domain routing and binomial probability modeling enables reliable performance across diverse applications such as autonomous navigation and augmented reality.

ZoeDepth is a monocular depth estimation framework designed to bridge the gap between relative depth generalization and metric depth precision. It introduces a unified approach that combines scale-invariant pre-training on diverse datasets with domain-specific metric fine-tuning, delivering both robust zero-shot transfer capabilities and high-fidelity metric depth predictions. Key technical innovations include a modular architecture featuring a shared encoder–decoder (based on transformer backbones), novel multi-scale bin-based metric heads, attractor layer refinement, and an automatic domain routing mechanism. ZoeDepth achieves competitive or state-of-the-art results on varied benchmarks, including NYU Depth v2 and KITTI, and demonstrates unprecedented zero-shot generalization to unseen indoor and outdoor datasets.

1. Design Principles and Architecture

ZoeDepth employs a two-stage encoder–decoder structure. The initial pre-training stage utilizes a MiDaS-style network with a transformer backbone (e.g., BEiT-L, Swin Transformer) trained on scale-invariant losses across twelve heterogeneous datasets. The decoder aggregates multi-scale image features, enabling the extraction of robust geometric representations agnostic to scene type.

Domain-specific metric heads—each realized via a metric bins module—are attached during fine-tuning. Metric heads are lightweight, facilitating efficient training and inference. Each head predicts a discrete set of depth bin centers ci(k)c_i(k), which are subsequently refined using several attractor points aka_k drawn from decoder feature pyramids. The bin refinement process operates as follows (inverse attractor variant):

Δci=k=1naakci1+αakciγ\Delta c_i = \sum_{k=1}^{n_a} \frac{a_k - c_i}{1 + \alpha|a_k - c_i|^\gamma}

where α\alpha and γ\gamma control attraction strength. This multi-scale adjustment contracts bin centers towards higher accuracy without explicit hierarchical splitting.

The network further predicts a unimodal, ordering-aware bin probability distribution using a binomial model:

p(k;N,q)=(Nk)qk(1q)Nkp(k; N, q) = \binom{N}{k}q^{k}(1-q)^{N-k}

where qq, and optionally a temperature tt, are learned for each pixel. The metric depth at pixel ii is computed as a weighted sum:

d(i)=k=1Ntotalpi(k)ci(k)d(i) = \sum_{k=1}^{N_{\mathrm{total}}} p_i(k) \cdot c_i(k)

Multi-domain deployments (e.g., NYU for indoor, KITTI for outdoor) are handled by associating separate metric heads with a latent classifier for automatic routing at inference.

2. Training Methodology

Training is organized in two stages:

  • Stage 1: Relative depth pre-training on twelve datasets (indoor, outdoor, synthetic, real) using scale-invariant log losses. This enables the encoder–decoder to learn geometric abstractions invariant to global scale or dataset-specific biases:

Lrel=1ni(gi)21n2(igi)2,gi=logdilogdiL_{\mathrm{rel}} = \frac{1}{n}\sum_i (g_i)^2 - \frac{1}{n^2}(\sum_i g_i)^2, \quad g_i = \log d_i - \log d^*_i

  • Stage 2: Metric fine-tuning on datasets with available ground-truth metric depth (typically NYU Depth v2 and KITTI). Each domain-specific head is fine-tuned separately or jointly, with a router determining selection during inference.

The combination of diverse pre-training and minimal metric fine-tuning enables accurate scale reintroduction on target domains, while preserving the generalization benefits of training on heterogeneous data.

3. The Metric Bins Module: Attractor Layers and Probability Formulation

The metric bins module refines a set of candidate bin centers ci(k)c_i(k) by integrating multi-scale attractor information. Rather than hierarchical splitting (such as AdaBins), ZoeDepth refines all bins at the decoder bottleneck and contracts them through attractor layers.

The final depth prediction uses a binomial probability model instead of softmax, explicitly exploiting the ordinal structure of depth and generating sharply unimodal, order-preserving distributions. For numerical stability, log-probabilities are computed via Stirling’s approximation, and temperature scaling is used as necessary.

This module is central to ZoeDepth’s ability to reintroduce metric scale in a way that is efficient, robust against varying depth distributions across domains, and resistant to common binning artifacts.

4. Quantitative and Qualitative Performance

On the NYU Depth v2 dataset, ZoeDepth in configuration ZoeD-X-N (no relative pre-training) achieves REL = 0.082, improving by 13.7% over NeWCRFs (REL = 0.095). With full relative pre-training (ZoeD-M12-N), REL drops to 0.075 (a total improvement of nearly 21%). Threshold accuracies δ1\delta_1, δ2\delta_2, and δ3\delta_3 are consistently higher (e.g., δ1\delta_1 = 0.955) than prior baselines.

In multi-domain joint training (ZoeD-M12-NK), performance drop on NYU is modest, while zero-shot accuracy on eight unseen datasets is unprecedented, including up to 976% improvement over NeWCRFs on the DIML Outdoor dataset. ZoeDepth demonstrates robust cross-domain transfer, delivering reliable metric depth predictions without further fine-tuning.

5. Zero-Shot Generalization and Routing

The combination of relative depth pre-training and domain-specific metric heads allows ZoeDepth to generalize to entirely unseen data distributions, scene geometries, and camera parameters. Automatic routing via a latent classifier or simple multi-layer perceptron (MLP) determines the optimal metric head, avoiding the pitfalls of applying a generic model to both indoor and outdoor scenes.

Adaptive bin refinement and attractor-based contraction allow the model to accommodate dataset and domain shifts while maintaining reliable metric predictions and boundary delineation.

6. Technical Implementation and Mathematical Details

Key technical elements include:

  • Transformer-based encoder
  • Multi-scale feature aggregation in the decoder
  • Attractor-based bin refinement (inverse and exponential variants)
  • Binomial probability modeling for depth prediction
  • Modular domain routing for zero-shot adaptation

The main equations governing predictions are:

d(i)=k=1Ntotalpi(k)ci(k)d(i) = \sum_{k=1}^{N_{\mathrm{total}}} p_i(k) \cdot c_i(k)

and attractor bin refinement

Δci=k=1naakci1+αakciγ\Delta c_i = \sum_{k=1}^{n_a} \frac{a_k - c_i}{1 + \alpha|a_k - c_i|^\gamma}

with bin probabilities as

p(k;N,q)=(Nk)qk(1q)Nkp(k; N, q) = \binom{N}{k}q^{k}(1-q)^{N-k}

7. Applications, Limitations, and Comparisons

ZoeDepth’s dense metric depth estimation supports diverse applications: autonomous navigation, augmented reality, virtual reality, scene reconstruction, and photo editing. Its robust zero-shot generalization makes it especially suitable for environments where domain-specific fine-tuning is infeasible.

Subsequent research identified competitive and superior approaches in specialized settings (e.g., ECoDepth’s ViT-based diffusion conditioning (Patni et al., 27 Mar 2024), PatchFusion’s high-resolution fusion (Li et al., 2023), and wildlife monitoring benchmarks (Niccoli et al., 6 Oct 2025)). While ZoeDepth offers rapid inference (0.17 s per image), its spatial accuracy can degrade in challenging outdoor environments—recording a mean absolute error of 3.087 m on wildlife camera trap images, compared to 0.454 m from Depth Anything V2 (Niccoli et al., 6 Oct 2025). Median-based depth aggregation is recommended for improved robustness in noisy conditions.

The metric bins module and latent routing are often adapted for downstream tasks such as 3D occupancy prediction (Zheng et al., 17 Jul 2024), NeRF-based scene editing (Guo et al., 1 May 2024), and scene enhancement in degraded domains (Huang et al., 27 Apr 2024).

Summary Table: ZoeDepth Configuration Performance

Configuration Relative Pre-training Metric Fine-tuning REL (NYU) Zero-shot generalization
ZoeD-X-N No NYU 0.082 Moderate
ZoeD-M12-N M12 (12 datasets) NYU 0.075 Excellent
ZoeD-M12-NK M12 NYU + KITTI ~0.078 Unprecedented

ZoeDepth’s modular approach and technical innovations have established new standards for cross-domain metric depth estimation, with the caveat that accuracy may be limited in highly unstructured outdoor environments unless further domain adaptation is performed.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ZoeDepth.