Zero-shot Monocular Depth Estimation
- Zero-shot MDE is the process of inferring dense depth maps from a single RGB image in new domains without task-specific fine-tuning.
- It integrates canonical camera representations, probabilistic multi-cue fusion, and hybrid physics-guided inference to overcome metric ambiguity and domain shifts.
- Empirical benchmarks demonstrate robust transfer across indoor, outdoor, and specialized datasets, while advanced loss functions and training regimens ensure scale invariance and detail preservation.
Zero-shot monocular depth estimation (MDE) refers to inferring a dense depth map, with metric or relative scale, from a single RGB image in a novel domain—without task-specific fine-tuning, additional supervision, or paired depth labels from the new environment. The goal is robust cross-domain generalization, enabling models trained on large-scale datasets, synthetic data, or self-supervision to immediately operate on previously unseen scenes, camera intrinsics, or environmental conditions. Recent research advances address the inherent ill-posedness, domain shift, and metric ambiguity in zero-shot settings with models integrating geometric invariants, canonical camera normalization, probabilistic cue fusion, foundation models, and hybrid physics-guided inference.
1. Canonical Camera Representations and Geometric Disentanglement
Metric ambiguity—stemming from entanglement of camera intrinsics (e.g., focal length) and extrinsics (pose, height, pitch) with pixel image coordinates—obstructs generalization in MDE. Canonicalization transforms decouple these variables, enabling robust transfer across devices and datasets.
- Focal Canonical Transform (FCT): Removes entanglement between focal length and the predicted depth. Networks regress canonical-focal depth maps , mapping to metric depth via (Koledic et al., 2024).
- Vertical Canonical Transform (VCT): Introduced in GVDepth, removes extrinsic ambiguities tied to camera height and pitch . Networks predict ground-contact vertical coordinates , inverted through ray-ground-plane intersection to metric depth:
The transformation enables per-pixel stable estimation regardless of the specific survey rig or sensor (Koledic et al., 2024).
Metric3Dv2 similarly virtualizes all training data into a single canonical pinhole camera model with focal length , either by rescaling depth labels (label-scaling: ) or warping the image and intrinsics themselves before prediction, with an inverse transform at inference (Hu et al., 2024).
2. Probabilistic Multi-Cue Fusion and Model Architectures
Zero-shot transfer benefits from integrating multiple geometric cues and adaptively fusing their predictions according to learned uncertainties.
- GVDepth Probabilistic Fusion: Fuses object-size and vertical image position cues via a two-branch architecture. Each branch regresses both a canonical depth map and a per-pixel log-uncertainty, producing 0 (size cue) and 1 (vertical cue). Final metric depth is computed by uncertainty-weighted averaging:
2
This lets the model emphasize whichever cue is more reliable in each region (e.g., vertical cue near camera, object-size at medium range) (Koledic et al., 2024).
- Difussion-based Refiners: Plug-and-play frameworks (e.g., BetterDepth, PrimeDepth) operate in the latent space of a pretrained diffusion backbone (Stable Diffusion), encoding both image features and coarse depth for iterative refinement. BetterDepth applies global scale/shift pre-alignment and local patch masking, so that detail enhancement is trustworthy and scene-consistent (Zhang et al., 2024). PrimeDepth leverages intermediate latent representations (“preimage”) after one denoising step, producing highly efficient and generalizable predictions (Zavadski et al., 2024).
- Joint Depth-Normal Feedback: Metric3Dv2 performs recurrent refinement of depth and surface normal predictions via ConvGRU blocks. This bidirectional geometric coupling supplies additional supervision signal even in the absence of explicit labels, improving overall structure fidelity (Hu et al., 2024).
3. Training Regimens and Loss Functions for Zero-Shot Generalization
Zero-shot MDE models rely on mixed-source training, specialized loss functions, and augmentation strategies engineered to transfer geometric priors, scale invariance, and robustness.
- Mixed Data and Robust Objectives: Scale- and shift-invariant per-image regression losses (e.g., image-level normalized regression, SiLog, L_{ssitrim}) permit effective mixing of metric, up-to-scale, and ordinal/relative data. Pareto-optimal multi-objective optimization further balances gradient contributions from disparate sources (Yin et al., 2022, Ranftl et al., 2019).
- Geometric and Photometric Augmentations: Random resizing, cropping, and ray jitter simulate varying focal lengths and sensor geometries, preventing overfitting to a single camera configuration and forcing models to internalize invariants (Koledic et al., 2024, Guizilini et al., 2023).
- Uncertainty-Weighted and Consistency Losses: GVDepth employs uncertainty-weighted Laplace likelihood losses per cue, as well as geometric-consistency penalties enforcing prediction invariance over random augmentations (Koledic et al., 2024). Local patch masking (BetterDepth) ensures refinement is focused on informative and trustworthy regions (Zhang et al., 2024).
For self-supervised frameworks like F²Depth, supervision stems from photometric patchwise losses and multi-scale feature synthesis losses, maximizing flow–depth geometric consistency in the absence of ground-truth depths (Guo et al., 2024).
4. Modalities, Hybrid Models, and Physics-Based Inference
To resolve metric ambiguity beyond learned priors, hybrid models exploit additional physical cues or sparse geometric priors at inference.
- Defocus-deblurring for Metricization: Marigold can be “repurposed” at test-time using defocus blur cues: two images (one all-in-focus, one defocused) provide the basis for optimizing a global scale and per-pixel depth against a differentiable, physics-based blur forward model, delivering truly metric density without retraining (Talegaonkar et al., 23 May 2025).
- Sparsity Adaptive Fusion (SPADE): Combines pre-trained relative depth networks (e.g., DepthAnythingV2) with sparse, metric priors from SLAM/SfM/local stereo. A two-stage system globally aligns predictions to sparsely sampled depths, followed by local per-pixel scale correction via cascade Conv-Deformable Transformer (CCDT) blocks. This yields real-time, dense metric depth estimation in challenging underwater domains, robust to out-of-distribution signals and prior sparsity (Zhang et al., 29 Oct 2025).
- Visual-Inertial Rescaling for Aerial Platforms: Visual-inertial rescaling fits a monotonic spline between learned per-pixel disparities and those from sparse 3D feature points extracted by VINS. This produces a globally consistent metric depth map, enabling on-board, real-time collision avoidance for UAVs in unseen environments—all without any direct depth calibration (Yang et al., 9 Sep 2025).
A summary of key approaches in hybrid and physics-instrumented zero-shot MDE is provided below.
| Method | Inference-time Domain Metricization | Specialized for |
|---|---|---|
| Marigold+Defocus (Talegaonkar et al., 23 May 2025) | Optimizes depth for blur consistency | Low-deep DoF, stationary camera |
| SPADE (Zhang et al., 29 Oct 2025) | Affine/scale alignment + CNN refinement | Underwater, SLAM/Stereo priors |
| VINS-Rescale (Yang et al., 9 Sep 2025) | Monotonic spline fitting on sparse points | Aerial robotics |
5. High-Resolution Refinement and Patchwise Processing
Inference on high-resolution imagery poses additional challenges: memory, depth discontinuities at patch boundaries, and generalization to fine details.
- Patch Refine Once (PRO): Efficient refinement mitigates discontinuities by training on grouped sets of four overlapping patches, enforcing a joint consistency loss across patch overlaps. Bias Free Masking discards unreliable synthetic GT regions—critical for avoiding overfitting when real GT is sparse or unreliable—guaranteeing sharpness and continuity even on gigapixel images (Kwon et al., 28 Mar 2025).
6. Empirical Benchmarks, Transfer Performance, and Limitations
Zero-shot MDE models are benchmarked on diverse indoor (NYUv2, ScanNet, DIODE), outdoor (KITTI, DDAD, Waymo), challenging (ETH3D, nuScenes, iBims-1), and special-domain datasets (FLSea VI for underwater, custom UAV tunnels, Middlebury 2014).
- Performance: Metric3Dv2 achieves AbsRel = 0.052, δ₁ = 0.974 (KITTI), and AbsRel = 0.063, δ₁ = 0.975 (NYUv2) zero-shot, outperforming prior art (Hu et al., 2024). GVDepth matches or outperforms other methods on ground-vehicle imagery with orders of magnitude less training data via canonical transforms and cue fusion (A.Rel = 11.8%, KITTI→DDAD) (Koledic et al., 2024). SPADE shows state-of-the-art underwater transfer with real-time operation (AbsRel = 0.042 on FLSea) (Zhang et al., 29 Oct 2025).
- Ablation Insights: Removal of canonical transforms or geometric embeddings consistently causes dramatic zero-shot degradation (A.Rel doubling in GVDepth). Hybrid cue fusion, recurrent refinement, and localization of confidence improve robustness and detail recovery across domains (Koledic et al., 2024, Hu et al., 2024, Zavadski et al., 2024).
- Limitations:
- Canonicalization approaches require access to intrinsic parameters at test time.
- Methods relying on sparse priors may fail with insufficient (e.g., <3) or poorly distributed reference points.
- Physics-guided metricization (e.g., defocus) assumes availability of calibrated, repeatable dual-aperture imaging and can be computationally expensive.
- Patchwise refiners may struggle on scenes with extreme local variability if patch extent is not adapted.
7. Outlook and Future Directions
Current research trajectories point toward:
- Domain-agnostic, physics-guided priors: Incorporation of additional imaging cues (e.g., polarization, active illumination) and inference-time constraints (e.g., blur, SLAM, multi-modality) promises further robustness in domains like deep underwater or planetary caves (Zhang et al., 29 Oct 2025).
- Transformers and Foundational Representations: Adoption of foundation models and learned preimages (e.g., PrimeDepth, BetterDepth) is driving large increases in detail and transferability without prohibitive data collection (Zavadski et al., 2024, Zhang et al., 2024).
- Self-supervised calibration and adaptive patching: Future methods will likely automate camera parameter discovery and adapt inference granularity to local scene statistics.
- Unified multi-task learners: Advances in joint depth, normal, and semantic prediction—augmented with iterative geometric feedback and physics signals—are expected to realize reliable zero-shot 3D perception “in the wild” (Hu et al., 2024).
Zero-shot monocular depth estimation thus synthesizes geometric invariance, scalable learning, probabilistic cue integration, and physically-motivated domain adaptation to deliver dense, robust 3D scene understanding across application domains and environments previously inaccessible to single-view methods.