Zero-Shot Depth Model

Updated 12 July 2025

Zero-shot depth models are frameworks that estimate 3D structure from images without any target domain depth supervision.
They leverage meta-learning, geometric priors, diffusion models, and self-supervised techniques to generalize across diverse camera setups and environments.
These models enable robust applications in autonomous driving, augmented reality, and 3D reconstruction by eliminating the need for expensive ground-truth depth data.

A zero-shot depth model is an algorithmic framework capable of estimating depth information from images (or sets of images) in scenarios where no ground-truth depth supervision is available for the target domain, task, or camera setup. This paradigm addresses the central challenge of generalization: producing accurate, metric or relative (affine-invariant) depth maps on previously unseen environments, camera settings, or even novel application domains, without further supervised training or explicit domain adaptation. Contemporary zero-shot depth models build upon meta-learning, geometric priors, powerful vision-language pretraining, diffusion models, and self-supervised adaptation, empowering a new class of transferable and broadly deployable 3D perception systems.

1. Foundations and Motivation

Zero-shot depth estimation arises from the need to recover 3D geometric information in settings devoid of direct ground-truth depth annotations, sensor ground-truth, or in the presence of significant domain shifts across datasets, camera models, or scene types.

The concept traces back to meta-learning ideas, exemplified by TTNet (1903.01092), which regresses the parameters of a model for an unseen (zero-shot) task—such as depth estimation—solely by leveraging the learned parameters from related tasks with available supervision and a task correlation matrix. Here, the learning process is inspired by cognitive science principles: for instance, depth perception can be inferred by relating to experienced modalities (e.g., self-motion) even in the absence of explicit supervision.

Modern advances in zero-shot depth extend this idea to large neural models trained on heterogeneous or unlabeled data, domain-adaptive representation learning, foundation vision transformers, and probabilistic generative models capable of capturing diverse distributions over depth, normal, and 3D structure.

2. Key Methodological Advances

2.1 Meta-Manifold Regression and Task Correlation Transfer

TTNet formalized zero-shot depth as a meta-learning problem, introducing a meta-learner function $\mathcal{F}$ that predicts the model parameters of a zero-shot task (e.g., depth estimation) based on encoder parameters from a set of known tasks and a task correlation matrix $\Gamma$ . The core loss combines fitting the predicted parameters to the meta-manifold with a data consistency loss ensuring the output encoder-decoder pair yields plausible predictions on real image data.

The architecture comprises per-task branches merged into a common block. The correlation matrix $\Gamma$ —derived via crowd-sourced votes and the Dawid–Skene algorithm—injects human-perceived task relationships, guiding the meta-learner to meaningful parameter mixtures. During inference for a novel task (such as depth), the transfer mode uses the parameters and correlations from the known tasks to regress optimal, high-performing weights for the unseen domain.

2.2 Domain-Generalization via Large-Scale Pretraining and Architectural Innovations

Beginning with ZoeDepth (2302.12288), zero-shot models increasingly rely on two-stage encoder-decoder architectures. Such models pretrain on relative depth over a diverse suite of datasets before fine-tuning (via domain-specific lightweight modules) on metric depth for select domains. Domain-wise metric heads, coupled with an MLP-based latent classifier, enable automatic routing during inference, ensuring optimal performance across both seen and unseen data. Key mathematical underpinnings involve ordinal regression over learned depth bins, with attractor-based adjustments to preserve monotonicity and domain-aligned scale calibration.

Models like ZeroDepth (2306.17253) further introduce explicit geometric embeddings at the input level, encoding each pixel's physical 3D ray as a function of the camera intrinsics, and decoupling encoder and decoder via a variational latent representation with quantified uncertainty. This approach ensures that the network learns scale priors invariant to the appearance or specific calibration of the input dataset.

Field-of-View (FOV)–conditioned diffusion models (2312.13252) train generative denoisers in a log-scale depth space, conditioning on explicit FOV signals (e.g., $\tan(\theta/2)$ , where $\theta$ is the vertical FOV) and leveraging aggressive FOV augmentation during training. This parameterization, in combination with architectural genericity, enables consistent joint modeling of indoor and outdoor scenes, overcoming the scale bias typical of fixed-intrinsic training.

Metric3Dv2 (2404.15506) resolves metric ambiguity across thousands of camera models via a canonical camera space transformation, scaling images or ground-truth depths to a shared reference focal length, and a joint depth–normal recurrent optimization module that distills geometric detail and consistency between tasks.

Recent approaches such as GRIN (2409.09896) and Marigold-DC (2412.13389) employ diffusion models directly on pixel-level or latent representations, integrating 3D geometric priors via camera-aware positional encodings. These frameworks enable robust generalization to sparse or irregular depth annotations and maintain scale-awareness across a wide range of domains.

2.3 Self-Supervised Adaptation, Temporal Consistency, and Few-Shot Rescaling

Several recent models exploit test-time or few-shot adaptation, where per-sample or per-scene scaling parameters are estimated using small sets of sparse depth measurements (e.g., from low-res LiDAR, SFM, or auxiliary sensors) (2412.14103). Linear regression with robust estimators (e.g., RANSAC) is used to fit scaling and offset parameters that map an affine-invariant disparity map (produced by models like Depth Anything) to metric depth. This avoids costly and potentially domain-overfitting fine-tuning, while preserving cross-domain generalization.

In video domains, test-time adaptation strategies enforce depth consistency across augmented frames, leveraging depth-aware modulation layers and self-supervised objectives to refine representations without additional human labels (2403.04258). Temporal attention architectures, as in Buffer Anytime (2411.17249), can inject smoothness or regularization across frames, enhancing temporal consistency in video depth prediction using only single-image priors and optical flow-based stabilization losses.

2.4 Probabilistic Multi-Cue Fusion and Novel Representations

Some domains, like ground vehicle monocular depth, exploit scene geometry to disentangle camera intrinsics from depth cues. GVDepth (2412.06080) introduces a vertical canonical representation, predicting ground-contact vertical positions and combining them with object-size cues. These dual estimates are fused adaptively, weighted by learned local uncertainty, yielding robust predictions across varying camera setups from single-dataset training.

In high-resolution or patchwise inference, approaches such as Patch Refine Once (PRO) (2503.22351) enforce grouped patch consistency and bias-free masking during training, overcoming artifacts of patchwise prediction and mitigating dataset-specific annotation bias.

3. Mathematical Formulations and Supervision Strategies

Zero-shot depth models are marked by architectural and loss function innovations that accommodate diverse input domains and supervision regimes. Foundational models integrate:

Meta-manifold regression losses over model parameters, guided by explicit correlation matrices (1903.01092).
Joint loss functions combining ordinal regression, bin adjustment, and deep attractor modules (2302.12288).
Variational and probabilistic losses over latent variables, with explicit regularization via KL divergence and uncertainty quantification (2306.17253).
Robust L1 and log-scale parameterization for depth prediction in generative diffusion models, often involving field-of-view conditioning and FOV-based positional encoding (2312.13252, 2409.09896).
Weighted loss schemes informed by per-pixel label confidence (2409.05442), and optimization loops that enforce sparse measurement constraints as hard or soft guidance during denoising (2412.13389, 2502.06338).
Consistency losses over grouped overlapping patches, coupled with masking to avoid overfitting to synthetic dataset biases (2503.22351).

Loss composition is often tailored to harness the strengths of pre-trained vision-language or generative priors while aligning them with new domain constraints during inference.

4. Evaluation, Generalization, and Performance

State-of-the-art zero-shot depth models are evaluated across diverse benchmarks, domains, and camera types, using metrics including root mean squared error (RMSE), absolute relative error (AbsRel), scale-invariant errors, and threshold-based accuracy indices (such as $\delta_1$ ).

Key empirical findings include:

TTNet achieved lower RMSE and ARD than prior supervised models on the Taskonomy dataset despite having no ground-truth for zero-shot target tasks (1903.01092).
ZoeDepth, with 12-dataset relative pretraining and metric fine-tuning, yields a 21% decrease in REL on NYUv2 compared to previous best methods, and delivers robust performance on both indoor and outdoor unseen domains (2302.12288).
DMD (field-of-view-conditioned diffusion) achieves 25–33% REL reductions over previous state-of-the-art on zero-shot indoor and outdoor datasets (2312.13252).
Metric3Dv2 establishes first-place accuracy on multiple benchmarks after training on over 16 million images from thousands of camera models, generalizing to wild images for both depth and normal estimation (2404.15506).

Recent models broaden the generalization envelope, enabling:

Cross-domain transfer to unique camera geometries (fisheye, 360°, ERP) (2501.02464),
Effective scaling with minimal adaptation via test-time sensor rescaling (2412.14103),
Robust performance on medical imaging (endoscopy) through robust self-learning from noisy labels (2409.05442),
Self-supervised adaptation to unknown metric scales via photometric novel view synthesis (2503.07125),
Zero-shot multi-view fusion with transformer-based cost volume processing (2503.22430),
and guided diffusion methods for domain-adaptive depth completion from sparse cues (2412.13389, 2502.06338).

5. Practical Applications and Implications

Zero-shot depth models support a vast array of real-world applications:

Autonomous Driving and Robotics: Accurate, metric, cross-domain depth estimation enables obstacle avoidance, mapping, planning, and safe operation in unseen environments, even with changes in camera geometry (2412.06080, 2306.17253).
Augmented and Virtual Reality: Reliable, dense depth enables seamless scene integration, real-time AR overlays, and environment understanding across devices with different sensors (2501.02464).
Medical Imaging (Endoscopy): Domain-general models such as EndoOmni (2409.05442) allow localization, navigation, and AR visualization in minimally invasive procedures without requiring extensive labeled data for each device or anatomical site.
3D Scene Reconstruction and Digital Twins: Zero-shot multi-view inference and robust patchwise high-resolution depth estimation facilitate fast and memory-efficient 3D reconstruction in both synthetic and real-world domains (2503.22351, 2503.22430).
3D Shape Completion and Novel View Synthesis: Transformer-based models such as RaySt3R (2506.05285) deliver geometrically consistent 3D object reconstructions from one or few views, benefitting robotics and XR.

By lowering the reliance on expensive, domain-specific ground-truth depth data, these methods accelerate the deployment of dense 3D perception in both legacy and novel camera setups.

6. Challenges, Limitations, and Future Directions

Despite substantial progress, zero-shot depth estimation faces enduring challenges:

Metric Ambiguity and Camera Calibration: The reliance on accurate camera metadata (focal length, orientation, FOV) remains a limiting factor. Canonicalization via explicit parameter transformation (2404.15506) and geometric cue fusion (2412.06080) have improved cross-setup generalization, but inaccuracies in metadata or extreme camera geometries (e.g., unknown fisheye distortion) require further research.
Handling Label Noise and Sparse Supervision: Self-learning from noisy pseudo-labels (2409.05442) and guided completion with sparse depth cues (2412.13389, 2502.06338) offer robust alternatives, yet integrating richer forms of uncertainty and selectively filtering unreliable supervision represent ongoing research areas.
Resolution and Scalability: Efficient high-resolution prediction (2503.22351), as well as scalable multi-view fusion (2503.22430), require innovations in memory and computational efficiency to meet real-time or embedded deployment requirements.
Temporal Consistency and Dynamic Scenes: The extension from still images to temporally consistent video depth (with or without supervision) is an active area of exploration (2411.17249), particularly for applications in SLAM, robotics, and real-time AR.
Generalization to Novel Physics: Innovative approaches have repurposed defocus blur cues for metric scale recovery (2505.17358), offering new avenues for leveraging physical phenomena (focus, motion, multi-aperture) in zero-shot settings.

A plausible implication is that future zero-shot depth models may unify generative, self-supervised, and geometric approaches under a foundation architecture, with improved interpretability, fast adaptation, and strong theoretical generalization guarantees. The scalability of training on large, mixed-domain datasets, combined with flexible domain-adaptive components, is likely to define new standards for 3D perception in broad real-world scenarios.