Depth Foundation Models

Updated 3 July 2026

Depth Foundation Models (DFMs) are large-scale neural networks with billions of parameters trained on tens of millions of images to enable zero-shot 3D scene understanding.
They integrate diverse architectural paradigms across monocular, stereo, and multi-view inputs to achieve robust depth estimation in varied applications.
Empirical evaluations show DFMs excel in tasks like 3D reconstruction and robotics, although challenges remain in data diversity, consistency, and explainability.

Depth Foundation Models (DFMs) are a paradigm of large-scale deep neural networks for depth estimation, engineered to learn universal, transferable geometric priors from massive datasets. Characterized by billion-scale parameters and exposure to tens of millions of diverse scenes, DFMs shift the field from narrowly trained regressors toward universal, zero-shot 3D scene understanding. This article synthesizes current research on DFM definitions, motivating factors, architectural variants, learning paradigms, empirical validation, and open challenges, drawing on recent surveys and technical reports in computer vision and robotics.

1. Definition and Motivation

A Depth Foundation Model (DFM) is defined by three criteria: (1) model scale on the order of a billion or more parameters; (2) training on tens of millions (or more) images encompassing heterogeneous domains; (3) robust zero-shot capability, i.e., strong generalization to unseen scenes and sensor configurations without retraining or fine-tuning (Xu et al., 15 Jul 2025, Li et al., 21 Jul 2025). In contrast, classical depth models have up to tens of millions of parameters, are dataset-specific (e.g., KITTI, NYU), and require explicit adaptation for each new domain.

The DFM movement emerged from several empirical and technological realities:

Hardware depth sensors (LiDAR, ToF) entail cost, low spatial fidelity, and environmental fragility; mobile consumer devices cannot deploy dense LiDAR.
Domain-specific models generalize poorly in edge cases (e.g., underwater, aerial, or nocturnal imagery).
Scaling laws demonstrated that increasing both parameter count and data diversity produces emergent generalization, consistent with the rise of BERT/GPT and generative vision models (e.g., Stable Diffusion).

DFMs thus pursue software-based “visual LiDAR”—robust, universal depth estimation across robotics, AR/VR, and autonomous systems (Xu et al., 15 Jul 2025).

2. Architectural Taxonomy

DFMs are organized by canonical input regimes and representative architectures. Each regime embodies a specific data association challenge and has evolved along distinct architectural lines:

Category	Backbone Progression	Key Innovations
Monocular	CNNs → ResNet–U-Net → ViT/DPT → diffusion-based (Marigold)	Scale-invariant/ordinal losses, binning, metric prediction via intrinsics
Stereo	CNN cost volume → Transf.–based stereo → recurrent RAFT–style, DFM hybrids	Monocular priors as pseudo-cost, diffusion stereo inpainting
Multi-View	2D CNNs → 3D CNN cascades → ViT-based (MVSFormer++) → diffusion + SfM points	Cost-volume reprojection, global context, photometric multi-view losses
Monocular Video	CNN+LSTM → Test-time opt (CVD) → Video ViT → video diffusion depth	Temporal consistency, end-to-end sequence denoising

Depth Representation

Early models regressed absolute depth, but ambiguity led to scale/affine-invariant losses and, more recently, depth discretization (AdaBins).
Metric DFMs compute absolute depth from relative predictions and camera intrinsics: $\hat{d} = \alpha(K) \cdot \hat{r}$ , with $\alpha$ dependent on focal length and bin centers.
Scale-invariant relative losses penalize errors up to a global scale shift: $L_{SI} = (1/n)\sum_i (d_i-\hat{d}_i)^2 - (1/n^2)(\sum_i (d_i-\hat{d}_i))^2$ (Xu et al., 15 Jul 2025).

Hybrid and Fusion Strategies

DFMs in stereo and multi-view tasks often incorporate monocular priors as “pseudo-costs,” initialize disparity with DFM outputs, and refine via CNNs or recurrent updates (Jiang et al., 16 Jan 2025, Zhu et al., 16 Apr 2025). In recent multi-modal settings (e.g., surgical vision), explicit geometric tokenization aligns depth/shape information with RGB cues, accelerating cross-task learning (Han et al., 26 Jan 2026).

3. Data, Training Paradigms, and Scaling Laws

Training Objectives

Supervised: Dense or sparse supervision with real-world ground truth (e.g., NYU, KITTI, A2D2).
Self-supervised: Photometric consistency (Godard et al.), geometric proxy losses, multi-task constraints (GeoNet).
Semi-supervised & Pseudo-supervised: Using robust monocular DFMs to generate dense pseudo-labels for sparsely labeled or label-free settings (Liang et al., 21 Mar 2025, Zhu et al., 16 Apr 2025).
Contrastive & Domain Adaptation: Matching feature distributions across real/synthetic domains to bridge the sim-to-real gap.

Data Regimes

DFM pretraining exploits both synthetic and real large-scale datasets:

Dataset	Scale / Type	Notable Properties
ARKitScenes	5K scenes, 450M frames	Dense LiDAR, consumer device
ScanNet++	1.8K scenes, 3.7M frames	Dense RGB-D, indoor
A2D2, Argoverse2	$\sim$ 10 $^5$ frames	Sparse LiDAR, driving
OBJVerse/MVImgNet	200–300M images	Synthetic, multi-view object
Bedlam, PointOdyssey	$\sim$ 10 $^6$ scenes	Synthetic dynamic scenes

Scaling laws are pronounced—2015 models used $\sim$ 10k images and $<1$ M parameters; by 2024, DFMs routinely exceed $10^9$ parameters and $\alpha$ 0 images. Larger models show reduced error and improved zero-shot transfer, mirroring trends in natural language processing (Xu et al., 15 Jul 2025, Li et al., 21 Jul 2025).

4. Empirical Performance and Evaluation

DFMs are benchmarked both with scale/shift alignment–based metrics (historically) and, more robustly, via proxy-task-based evaluation that reflects real-world deployment (Li et al., 21 Jul 2025).

Proxy-Task Evaluation

BenchDepth proposes direct proxy-task evaluation in five settings, each measuring DFM efficacy for an actual application:

Task	Relevance/Method	Highlighted Findings
Depth Completion	Fill dense map from sparse LiDAR + RGB	DAV2-Rel leads, surpassing “no-depth” and metric DFMs
Stereo Matching	Rectified pair $\alpha$ 1 disparity map	Affine-invariant DFMs ↑ zero-shot generalization
3D Reconstruction	3DGS from single RGB + DFM depth	Metric–fine-tuned DFMs excel in view synthesis
SLAM	Trajectory + dense mapping from video	Relative DFMs (DAV2-Rel) enable improved mapping
Vision–Language	Spatial QA with image/depth	All DFMs yield similar gains, exposing VLM limits

Representative Empirical Results

Ranftl et al. (DPT) achieved sub-5% relative error zero-shot across 10 benchmarks (Xu et al., 15 Jul 2025).
Metric3D v2, UniDepth: sub-0.5m absolute error outdoors, no fine-tuning.
DEFOM-Stereo: first place on KITTI, Middlebury, and ETH3D stereo leaderboards (Jiang et al., 16 Jan 2025).
DFMs outperform prior art in real-world zero-shot canopy height estimation with $\alpha$ 21/6 parameters and minimal compute (Cambrin et al., 2024).

Limitations of Alignment-Based Evaluation

Standard metric alignment exaggerates differences between representation types (depth/disparity/point map), is sensitive to outliers, and advantages over-smoothed predictions. Proxy-task evaluation circumvents these issues, focusing on end-task utility (Li et al., 21 Jul 2025).

DFMs have been tailored for or leveraged in specialized domains:

Surgical and Medical Vision: Fine-tuned or adapted ViT-based DFMs (e.g., Surgical-DINO using LoRA, Surgical Depth Anything) achieve SOTA depth estimation in endoscopic scenes, correcting for domain shift and specular artifacts (Cui et al., 2024, Lou et al., 2024, Han et al., 26 Jan 2026).
Robotics: Self-supervised depth-only DFMs (DeFM) extract universal geometric features from 60M depth images, facilitating sim-to-real transfer across navigation, manipulation, and segmentation, and can be distilled into compact variants for onboard deployment (Patel et al., 26 Jan 2026).
Depth Completion: DFM-derived priors provide dense geometric pseudo-supervision for LiDAR completion, removing the scale ambiguity and outperforming full-supervised baselines even out-of-distribution (Liang et al., 21 Mar 2025, Chen et al., 7 Aug 2025).
Multi-View and Stereo: Integration of DFM priors as pseudo-labels, disparity initializers, or supervision significantly boosts unlabelled MVS and zero-shot stereo (Zhu et al., 16 Apr 2025, Jiang et al., 16 Jan 2025).

6. Applications and Broader Impact

DFMs underpin a broad array of vision tasks:

3D Reconstruction: High-fidelity point clouds and volumetric mappings for AR/VR, SLAM, and scene reconstruction (SimpleRecon, NeuRIS).
Novel View Synthesis: DFM priors guide radiance field construction (NeRF) and Gaussian Splatting, improving rendering convergence and fidelity.
Robotics/Autonomous Driving: Camera-only 3D perception for real-time mapping, navigation, and collision avoidance at commodity cost.
World Modeling in Video: Video diffusion DFMs encode 3D inductive biases for future-frames synthesis and policy planning.

Additionally, DFMs dramatically increase data efficiency (e.g., MultiMAE in surgical tasks achieves higher accuracy with only 25% of labels) and enable plug-and-play downstream transfer via frozen backbones (Han et al., 26 Jan 2026).

7. Outstanding Challenges and Future Directions

Despite strong performance, DFMs face several technical and epistemological barriers (Xu et al., 15 Jul 2025, Tan et al., 21 Apr 2025):

Data scale/diversity: Gathering or synthesizing large-scale, high-fidelity ground-truth remains costly; advances in self-supervision and synthetic realism are needed.
Spatial and temporal consistency: Open questions remain regarding unified, volumetric representation across monocular/multi-view/video inputs.
Multi-task Learning: Extending DFMs to predict normals, semantics, optical flow alongside depth—in a single model analogous to GPT-style LLMs—is an active frontier.
Geometric Inductive Biases: Reconciling brute-force scaling with architecturally-imposed geometric priors (e.g., epipolar attention, neural implicit representations) is underexplored.
Explainability: Intrinsic barriers due to scale, nonlinearity, and high-dimensional data dependence limit mechanistic interpretability; explainability may need to shift toward empirical testing and behavioral assurances (Tan et al., 21 Apr 2025).

As DFMs increase in scale and diversity, the prospect emerges of universal, “visual LiDAR”–level scene understanding from a single camera, transforming applications in world modeling, AR/VR, medical vision and robotic autonomy (Xu et al., 15 Jul 2025, Li et al., 21 Jul 2025, Patel et al., 26 Jan 2026).