Zero-Shot Depth Estimation
- Zero-shot depth estimation is a technique for predicting dense, per-pixel depth maps in unseen environments without retraining or new ground truth supervision.
- It leverages large-scale foundation models, geometric embeddings, and domain adaptation to address challenges like scale ambiguity and varying camera intrinsics.
- Applications include robotics, autonomous vehicles, AR/VR, medical imaging, and video analysis, enabling robust depth predictions across diverse sensor setups.
Zero-shot depth estimation refers to the task of predicting depth (typically dense, per-pixel maps) for input images or videos in domains, environments, or camera configurations entirely unseen during training—specifically, without requiring retraining, fine-tuning, or new ground truth supervision. The problem encompasses both monocular and multi-view approaches, and extends to fusion with additional modalities (e.g., radar), structured environments (robotics, AR/VR), and varied sensor types. This article surveys the conceptual foundations, technical pathways, and representative methodologies shaping the field, focusing on approaches that are validated by rigorous cross-dataset or cross-domain experiments and establish competitive metric or relative depth estimation performance across diverse scenarios.
1. Foundations and Problem Definition
Zero-shot depth estimation is motivated by the need to generate reliable depth predictions for previously unseen deployment conditions, such as new sensor setups, novel scene types, or domains with little or no annotated data. This is particularly relevant in practical settings like robotics, autonomous vehicles, AR/VR, and medical imaging, where real-world visual variability and the prohibitive cost of collecting dense ground truth depth necessitate models that generalize beyond the training distribution.
Central to the problem is the dichotomy between relative (affine-invariant) depth estimation—where depth is accurate up to an unknown global scale and shift—and metric (absolute) depth estimation—where predicted depth values correspond to physically meaningful distances. Zero-shot models must overcome not only appearance and content shifts but also camera-intrinsic and scale ambiguities, geometric priors variation, and sensor distortions encountered in diverse real-world deployments.
2. Approaches Based on Generalization of Monocular Models
Monocular zero-shot depth estimators generally follow one of two paradigms: exploiting large-scale, heterogeneous multi-dataset training to build foundation models with strong generalization, or deploying domain adaptation and scale recovery modules for test-time calibration.
Large-scale Foundation Models:
Models such as ZoeDepth (Bhat et al., 2023) and Metric3Dv2 (Hu et al., 22 Mar 2024) build on diverse, multi-domain datasets and employ architectures with modular, domain-adaptive metric heads, or explicit canonical space transformations to decouple the metric ambiguity induced by camera intrinsics. For example, Metric3Dv2 normalizes the metric scale of all training images via a Canonical Camera Space Transformation, which allows a single model to recover metric depth for arbitrary unseen camera parameters by re-scaling predictions as a function of focal length and principal point. ZoeDepth combines relative-depth pre-training (across 12+ datasets) with lightweight, domain-specialized metric regressors (density-adjusted via attractor layers) that are automatically routed during inference.
Geometric Embeddings and Variational Modelling:
ZeroDepth (Guizilini et al., 2023) introduces input-level geometric embeddings encoding pixel-wise viewing direction via camera intrinsics; by concatenating these with traditional image features and decoupling encoder/decoder via a variational latent representation, the model supports robust, domain-agnostic metric scale priors.
Domain and Camera Adaptation:
DAC (Guo et al., 5 Jan 2025) unifies various camera modalities (perspective, fisheye, 360°) by converting every input into an equi-rectangular projection and augmenting training with synthetic ERP patches, pitch-aware normalization, FoV alignment, and multi-resolution augmentation. This enables models trained strictly on perspective data to robustly transfer to extreme wide-FoV sensors without seen data from those domains.
3. Supervision Transfer, Self-supervision, and External Cues
Scale/Shift Calibration via Source Domains:
Transfer-based methods (Dana et al., 2023, Yang et al., 9 Sep 2025) exploit the observed linear relationship (up to an undetermined scalar) between self-supervised monocular depth predictions and ground truth depth in any given camera setting. By estimating a global scaling factor (via robust regression, e.g., Theil–Sen estimator) from images/lenses with available metric labels and transferring this scalar to target domains without labels, the model recovers metric depth in zero-shot fashion. These methods are lightweight and require only a small amount of GT or sparse 3D feature mapping for calibration (as in the visual-inertial rescaling approach for aerial navigation).
Test-time Physics-based Adaptation:
HybridDepth (Ganj et al., 26 Jul 2024) and recent extensions based on Marigold (Talegaonkar et al., 23 May 2025) integrate defocus blur cues at inference, leveraging the distinguishable impact of aperture on blur formation to estimate the metric scale. By capturing both all-in-focus and defocus-blurred images, the method optimizes the affine depth scaling and noise latents of a pretrained relative-depth diffusion model, matching the observed to predicted blurred images via a differentiable thin-lens forward model, thus resolving scale ambiguity without retraining.
Fusion with Complementary Modalities:
HybridDepth (Ganj et al., 26 Jul 2024) also demonstrates global and local alignment of single-image priors (relative depth) with metric depth from depth-from-focus on focal stacks, yielding robust metric outputs across device heterogeneity. In radar-camera scenarios, SA-RCD (Zhang et al., 5 Jun 2025) leverages detailed zero-shot monocular depth maps to guide structure-aware enhancement of sparser radar depth, utilizing residual fusion and multi-scale attention for final metric estimation.
4. Diffusion Models and Novel Architectures
Recent proposals leverage score-based generative models for zero-shot depth prediction. DMD (Saxena et al., 2023) and GRIN (Guizilini et al., 15 Sep 2024) use field-of-view (FOV) conditioning and log-scale parameterization to handle a broad range of scene scales and camera intrinsics. GRIN, using a pixel-level diffusion process within a Recurrent Interface Network, combines 3D geometric positional encodings and local/global image features, efficiently addressing sparse and unstructured GT regimes often encountered during real-world deployment. Diffusion-based completion models such as Marigold-DC (Viola et al., 18 Dec 2024) allow dense monocular depth priors to be anchored by sparse depth observations through iterative, test-time optimization, yielding strong generalization with minimal guidance.
Temporal and Video Depth Extensions:
Buffer Anytime (Kuang et al., 26 Nov 2024) introduces a zero-shot video depth estimation framework by regularizing single-image depth model predictions with temporal consistency via optical flow and augmenting backbones with temporal attention blocks. This design significantly improves temporal coherence and geometric buffer stability in video applications—without access to paired video–geometry training data.
5. Evaluation, Performance, and Use-Case-Specific Models
Zero-shot depth estimators are commonly evaluated on cross-dataset transfer with no additional adaptation to the target domain. Key metrics include AbsRel, RMSE, log₁₀, and threshold accuracies (δ₁, δ₂, δ₃).
Representative results:
- GRIN (Guizilini et al., 15 Sep 2024) achieves consistently lower error rates across eight public datasets, outperforming previous SOTA (Metric3Dv2, ZoeDepth) in both indoor and outdoor benchmarks.
- DMD (Saxena et al., 2023) reports 25% and 33% REL reduction against prior SOTA for indoor and outdoor benchmark transfers, respectively, with very few denoising steps.
- EndoOmni (Tian et al., 9 Sep 2024) in endoscopic imaging achieved a 33–34% lower AbsRel over previous models in zero-shot relative depth estimation for medical domains.
- DAC (Guo et al., 5 Jan 2025) observed δ₁ improvements up to 50% on datasets from fisheye/360° cameras when generalizing from a perspective-trained backbone.
Zero-shot models are increasingly being extended to:
- Robotics and aerial navigation, with real-time, compute-constrained quadrotor deployment demonstrated using visual-inertial scale recovery (Yang et al., 9 Sep 2025).
- High-resolution inference (PRO framework (Kwon et al., 28 Mar 2025)) via grouped patch consistency and bias-free masking, mitigating discontinuity artifacts across patch seams and filtering out unreliable synthetic supervision.
- Multi-view stereo generalization (MVSA (Izquierdo et al., 28 Mar 2025)) via transformer-based fusion of monocular and geometric cues, adaptive cost volumes, and cascaded depth range estimation for arbitrary source-view configurations and scenes.
6. Limitations, Open Challenges, and Future Directions
Despite significant progress, several persistent challenges are noted:
Challenge | Manifestation | Strategies / Open Problems |
---|---|---|
Scale ambiguity/camera-specific coupling | Poor generalization when focal/elevation changes | Canonical transforms, ERP, defocus cues |
Limited metric calibration | Scarcity of GT/calibration data in new domains | On-the-fly scaling with sparse 3D features |
Discretization/statistical bias | Coarse binning, bias in synthetic datasets | Attractor bins, bias-free masking, robust loss |
Computational burden | High memory for diffusion/test-time tuning | Efficient RIN, reduced denoising steps, patch refinement |
Sparse/unstructured data regimes | Limited labeled supervision | Pixel-level diffusion, robust self-guided training |
Key open directions include fully unsupervised or self-adaptive metric depth recovery in the wild, more unified frameworks for multi-modal and multi-view settings, advanced uncertainty modeling in fusion schemes, and efficient architectures for real-time, high-resolution, or embedded applications.
7. Applications and Impact
Zero-shot depth estimators, by removing the need for exhaustive per-domain retraining and calibration, unlock and accelerate wide deployment for:
- Autonomous vehicles and mobile robotics (robust navigation under changing hardware, sensor, and environmental conditions)
- Augmented/virtual reality (realistic scene understanding with arbitrary cameras)
- Medical imaging (cross-patient and cross-instrument transfer for tasks like endoscopic navigation)
- 3D vision and digital twins (single-image metrology, scene reconstruction, object completion)
- Video analysis and post-production (temporal geometric buffer extraction from unlabeled videos)
- Scientific imaging (microscopy, planetary exploration) where ground-truth collection is prohibitive
The field continues to evolve toward more robust, efficient, and domain-agnostic solutions that integrate both data-driven and physics-based priors, ultimately blurring the boundaries between relative, metric, multi-modal, and multi-view zero-shot depth estimation paradigms.