Zero-Shot Monocular Depth Estimation
- Zero-shot monocular depth estimation is defined as recovering metric or relative depth maps from a single image without test-time fine-tuning, using scale-invariant losses and canonical transformations.
- State-of-the-art approaches leverage deep network architectures, diffusion-based refinement, and tailored loss functions to ensure geometric consistency and robust cross-dataset generalization.
- By combining geometric cues, physical priors, and multi-dataset mixing strategies, these methods effectively address inherent scale ambiguity and variable camera parameters.
Zero-shot monocular depth estimation refers to recovering the metric (absolute) or relative depth map of a scene from a single RGB image, without model fine-tuning or dataset-specific adaptation at test time. Recent research demonstrates that both diffusion-based priors and large-scale supervised models can generalize surprisingly well to out-of-distribution scenes, but the problem remains fundamentally ill-posed due to scale ambiguity, camera parameter variation, and limited training data diversity. Solutions exploit geometric cues, learned priors, cross-domain dataset mixing, canonical transformations, test-time rescaling, and novel optimization strategies to mitigate these challenges.
1. Scale Ambiguity and Canonical Representation
The central challenge in monocular depth estimation is inherent scale ambiguity: pixel-wise depth is only determined up to a global scale and offset when predicting from a single image. This scale ambiguity is exacerbated in zero-shot settings where camera intrinsics and environmental statistics at test time may differ from those in the training domain. Foundational contributions such as MiDaS (Ranftl et al., 2019) leverage scale- and shift-invariant losses, optimizing for affine-invariant depth or disparity fields to enable robust cross-dataset transfer. More recent works, such as Metric3Dv2 (Hu et al., 2024) and ZeroDepth (Guizilini et al., 2023), incorporate explicit canonical camera-space transformation modules to normalize scene geometry and disentangle depth prediction from focal length and pixel pitch. The Canonical Camera-Space Transformation (CSTM) applies either label scaling or input-image resizing so that at both train and inference time, depths are referenced to a fixed canonical focal length and metric scale is recovered via .
GVDepth (Koledic et al., 2024) further exploits the ground vehicle context, leveraging vertical image position and size-based cues, represented canonically to avoid dataset-specific overfitting. The probabilistic fusion framework adaptively weights object-size and vertical-geometric cues in a minimum-variance fashion, achieving zero-shot generalization across autonomous driving datasets with arbitrary camera setups.
2. Deep Network Architectures and Loss Functions
Zero-shot monocular depth estimation frameworks have, since MiDaS (Ranftl et al., 2019), evolved to employ high-capacity vision transformers, residual encoders, and multi-scale convolutional decoders. Data-centric models such as AnyDepth (Ren et al., 6 Jan 2026) utilize DINOv3 ViT backbones, combined with an efficient Simple Depth Transformer (SDT) decoder (single-path weighted fusion, spatial detail enhancer, progressive upsampling). To reduce computation and improve training data quality, AnyDepth incorporates sample filtering based on depth distribution and smoothness scores.
Loss functions invariant to global scale and shift are common: For example, MiDaS and successors minimize -aligned MAE or MSE, computed after per-image optimal affine alignment; Metric3Dv2 introduces Random Proposal Normalization Loss (RPNL) for local patch-level invariance. Gradient-matching, multi-scale gradient, or surface-normal regularization losses further enhance edge fidelity and geometric consistency. BetterDepth (Zhang et al., 2024) adds global pre-alignment and local patch masking during diffusion-based refinement, ensuring that detail-enhancement does not disrupt underlying geometric priors.
3. Geometric and Physical Cues for Metric Disambiguation
Many state-of-the-art approaches exploit physical imaging cues or external priors to recover absolute scale at inference. The Marigold-based defocus blur approach (Talegaonkar et al., 23 May 2025) synthesizes lens blur using the thin-lens approximation. By capturing two images at different apertures, it fits affine scale and offset parameters and optimizes noise latents via gradient descent against a physically-motivated loss between observed and synthesized blur, anchored to known lens metadata. SPADE (Zhang et al., 29 Oct 2025), designed for underwater settings, incorporates sparse depth priors (from stereo, SLAM, or sonar), globally aligns predicted relative depth, and applies a Cascade Conv-Deformable Transformer (CCDT) for local refinement. Similar test-time rescaling strategies utilize sparse LiDAR, stereo correspondences, or visual-inertial SLAM-derived 3D points (Marsal et al., 2024, Yang et al., 9 Sep 2025), fitting affine or monotonic spline functions to map network predictions to metric depth.
PrimeDepth (Zavadski et al., 2024) and GRIN (Guizilini et al., 2024) extend this paradigm with efficient diffusion models leveraging frozen pre-trained generative priors and explicit 3D geometric conditioning. GRIN, in particular, achieves direct metric predictions from scratch using pixel-level diffusion with camera intrinsics supplied as per-pixel positional encodings and is robust to unstructured, sparse depth supervision.
4. Dataset Mixing, Transfer Strategies, and Generalization
Robust generalization in zero-shot monocular depth estimation is underpinned by diverse dataset mixing strategies. MiDaS (Ranftl et al., 2019) combines five sources—web stereo, SfM reconstructions, video stereo, metric RGB-D, and cinematic frames—training with scale- and shift-invariant losses and principled multi-objective (MGDA) gradient descent to avoid dataset bias. Metric3Dv2 (Hu et al., 2024) amasses 16 million images from 18 sources with known intrinsics, stratified batching, and per-image canonical transformation, achieving state-of-the-art zero-shot metric depth and normals. Ablations repeatedly highlight the necessity of scale disentanglement, knowledge distillation (e.g. joint depth-normal optimization), and geometric augmentation (crops, focal jitter, ray noise) in securing cross-domain adaptation.
Patch-wise refinement methods such as PRO (Kwon et al., 28 Mar 2025) address the resolution mismatch between training and test images via grouped patch consistency training and bias-free masking. PRO efficiently processes high-resolution images in a tiling scheme while enforcing cross-patch consistency, mitigating seam artifacts typical of naive patch-based refinement and running orders of magnitude faster than multi-crop ensembling.
5. Evaluation Protocols and Quantitative Results
Zero-shot frameworks are evaluated out-of-the-box on unseen domains, with no scale correction or fine-tuning. Widely reported metrics include Absolute Relative Error (AbsRel), RMSE, scale-invariant log error, and threshold accuracy (, , : percent of pixels within of ground truth). Noteworthy results include:
- Metric3Dv2 (Hu et al., 2024): , AbsRel = 0.063, RMS = 0.251 on NYUv2; , AbsRel = 0.052, RMS = 2.511 on KITTI.
- GRIN (Guizilini et al., 2024): AbsRel = 0.046 on KITTI, 0.093 on DDAD, and 0.058 on NYUv2.
- SPADE (Zhang et al., 29 Oct 2025): AbsRel = 0.042 (FLSea), 0.025 (Lizard Island); >15 FPS inference.
- AnyDepth (Ren et al., 6 Jan 2026): AbsRel improvements of 7–23% over DPT on NYUv2, KITTI, ETH3D, ScanNet, DIODE.
- PrimeDepth (Zavadski et al., 2024) (ensemble with Depth Anything): Sets new zero-shot SOTA (Avg. Rank=1.3) across five datasets; ≈110× faster than Marigold.
A consistent finding is that scale- and shift-invariant objectives, per-image canonical normalization, and diffusion-based refinement yield competitive or superior results compared to fine-tuned or single-dataset baselines, especially when augmented with physical or geometric priors.
6. Limitations and Future Directions
Zero-shot monocular metric depth estimation, despite substantial progress, remains limited by several factors:
- Inference speed: diffusion-based refinement and latent optimization are computationally demanding (e.g., Marigold+defocus ≈4 min/scene; GRIN ≈0.8 s/frame).
- Sparse geometry anchoring: external anchors such as sparse depth, IMU, or SLAM points are necessary for metric grounding; failure in anchor generation degrades results.
- Domain-specific artifacts: highly reflective, transparent, or textureless surfaces, extreme camera calibrations, and novel artistic styles can impair generalization.
- Physical modeling limitations: lens blur, occlusions, and non-idealities in the PSF model may require coded-aperture designs or improved forward models (Talegaonkar et al., 23 May 2025).
- Model complexity and deployability: large transformer/diffusion backbones and multi-stage refinement pipelines challenge resource-constrained deployment.
- Patch-wise strategies: content-adaptive tiling and global attention mechanisms may further enhance efficiency and consistency (Kwon et al., 28 Mar 2025).
- Self-supervision and joint tasks: extensions to multi-modal cue fusion (inertial, stereo), surface normals, and uncertainty modeling are advocated across works.
A plausible implication is that future zero-shot monocular depth estimators will incorporate meta-learned adaptation mechanisms, uncertainty-aware rescaling, neural radiance cues, and multi-task geometric pipelines, with broader applications in robotics, AR, underwater inspection, and single-image metrology.
7. Comparative Table: Notable Zero-Shot Monocular Depth Estimation Methods
| Method / Paper | Key Innovation | Absolute Depth Recovery | Generalization Mechanism |
|---|---|---|---|
| MiDaS (Ranftl et al., 2019) | Scale- & shift-invariant loss, multi-dataset mixing | via affine alignment | MGDA multi-objective training |
| ZeroDepth (Guizilini et al., 2023) | Geometric embedding, decoupled latent | direct (metric) | Encoder-geometry fusion |
| Metric3Dv2 (Hu et al., 2024) | Canonical-space transformation, joint depth-normals | direct (metric) | 16M images, 18 datasets |
| Marigold+Defocus (Talegaonkar et al., 23 May 2025) | Defocus blur cues, latent optimization | direct (metric, training-free) | Test-time gradient descent |
| PRO (Kwon et al., 28 Mar 2025) | Grouped patch consistency, bias-free masking | affine-invariant | Patch-wise refinement |
| SPADE (Zhang et al., 29 Oct 2025) | Sparse depth priors, CCDT refinement | metric (real-time) | Two-stage pipeline, deformable attention |
| BetterDepth (Zhang et al., 2024) | Plug-and-play diffusion refiner | affine + details | Conditioning from pretrained backbone |
| PrimeDepth (Zavadski et al., 2024) | Frozen SD preimage, single-step refiner | affine-invariant, ensemble | Multi-scale SD prior |
| GRIN (Guizilini et al., 2024) | Pixel-level diffusion, geometric conditioning | metric (direct) | Sparse training, per-pixel encoding |
| AnyDepth (Ren et al., 6 Jan 2026) | Simple transformer decoder, quality filtering | affine-invariant | Data-centric, lightweight |
This selection illustrates the diverse landscape of zero-shot monocular depth estimation, the role of architectural and algorithmic innovations, and the convergence of physical modeling, computational efficiency, and geometric priors in addressing scale ambiguity and cross-domain transfer.