Depth Foundation Models
- Depth Foundation Models (DFMs) are large-scale neural networks with billions of parameters trained on tens of millions of images to enable zero-shot 3D scene understanding.
- They integrate diverse architectural paradigms across monocular, stereo, and multi-view inputs to achieve robust depth estimation in varied applications.
- Empirical evaluations show DFMs excel in tasks like 3D reconstruction and robotics, although challenges remain in data diversity, consistency, and explainability.
Depth Foundation Models (DFMs) are a paradigm of large-scale deep neural networks for depth estimation, engineered to learn universal, transferable geometric priors from massive datasets. Characterized by billion-scale parameters and exposure to tens of millions of diverse scenes, DFMs shift the field from narrowly trained regressors toward universal, zero-shot 3D scene understanding. This article synthesizes current research on DFM definitions, motivating factors, architectural variants, learning paradigms, empirical validation, and open challenges, drawing on recent surveys and technical reports in computer vision and robotics.
1. Definition and Motivation
A Depth Foundation Model (DFM) is defined by three criteria: (1) model scale on the order of a billion or more parameters; (2) training on tens of millions (or more) images encompassing heterogeneous domains; (3) robust zero-shot capability, i.e., strong generalization to unseen scenes and sensor configurations without retraining or fine-tuning (Xu et al., 15 Jul 2025, Li et al., 21 Jul 2025). In contrast, classical depth models have up to tens of millions of parameters, are dataset-specific (e.g., KITTI, NYU), and require explicit adaptation for each new domain.
The DFM movement emerged from several empirical and technological realities:
- Hardware depth sensors (LiDAR, ToF) entail cost, low spatial fidelity, and environmental fragility; mobile consumer devices cannot deploy dense LiDAR.
- Domain-specific models generalize poorly in edge cases (e.g., underwater, aerial, or nocturnal imagery).
- Scaling laws demonstrated that increasing both parameter count and data diversity produces emergent generalization, consistent with the rise of BERT/GPT and generative vision models (e.g., Stable Diffusion).
DFMs thus pursue software-based “visual LiDAR”—robust, universal depth estimation across robotics, AR/VR, and autonomous systems (Xu et al., 15 Jul 2025).
2. Architectural Taxonomy
DFMs are organized by canonical input regimes and representative architectures. Each regime embodies a specific data association challenge and has evolved along distinct architectural lines:
| Category | Backbone Progression | Key Innovations |
|---|---|---|
| Monocular | CNNs → ResNet–U-Net → ViT/DPT → diffusion-based (Marigold) | Scale-invariant/ordinal losses, binning, metric prediction via intrinsics |
| Stereo | CNN cost volume → Transf.–based stereo → recurrent RAFT–style, DFM hybrids | Monocular priors as pseudo-cost, diffusion stereo inpainting |
| Multi-View | 2D CNNs → 3D CNN cascades → ViT-based (MVSFormer++) → diffusion + SfM points | Cost-volume reprojection, global context, photometric multi-view losses |
| Monocular Video | CNN+LSTM → Test-time opt (CVD) → Video ViT → video diffusion depth | Temporal consistency, end-to-end sequence denoising |
Depth Representation
- Early models regressed absolute depth, but ambiguity led to scale/affine-invariant losses and, more recently, depth discretization (AdaBins).
- Metric DFMs compute absolute depth from relative predictions and camera intrinsics: , with dependent on focal length and bin centers.
- Scale-invariant relative losses penalize errors up to a global scale shift: (Xu et al., 15 Jul 2025).
Hybrid and Fusion Strategies
DFMs in stereo and multi-view tasks often incorporate monocular priors as “pseudo-costs,” initialize disparity with DFM outputs, and refine via CNNs or recurrent updates (Jiang et al., 16 Jan 2025, Zhu et al., 16 Apr 2025). In recent multi-modal settings (e.g., surgical vision), explicit geometric tokenization aligns depth/shape information with RGB cues, accelerating cross-task learning (Han et al., 26 Jan 2026).
3. Data, Training Paradigms, and Scaling Laws
Training Objectives
- Supervised: Dense or sparse supervision with real-world ground truth (e.g., NYU, KITTI, A2D2).
- Self-supervised: Photometric consistency (Godard et al.), geometric proxy losses, multi-task constraints (GeoNet).
- Semi-supervised & Pseudo-supervised: Using robust monocular DFMs to generate dense pseudo-labels for sparsely labeled or label-free settings (Liang et al., 21 Mar 2025, Zhu et al., 16 Apr 2025).
- Contrastive & Domain Adaptation: Matching feature distributions across real/synthetic domains to bridge the sim-to-real gap.
Data Regimes
DFM pretraining exploits both synthetic and real large-scale datasets:
| Dataset | Scale / Type | Notable Properties |
|---|---|---|
| ARKitScenes | 5K scenes, 450M frames | Dense LiDAR, consumer device |
| ScanNet++ | 1.8K scenes, 3.7M frames | Dense RGB-D, indoor |
| A2D2, Argoverse2 | 10 frames | Sparse LiDAR, driving |
| OBJVerse/MVImgNet | 200–300M images | Synthetic, multi-view object |
| Bedlam, PointOdyssey | 10 scenes | Synthetic dynamic scenes |
Scaling laws are pronounced—2015 models used 10k images and M parameters; by 2024, DFMs routinely exceed parameters and 0 images. Larger models show reduced error and improved zero-shot transfer, mirroring trends in natural language processing (Xu et al., 15 Jul 2025, Li et al., 21 Jul 2025).
4. Empirical Performance and Evaluation
DFMs are benchmarked both with scale/shift alignment–based metrics (historically) and, more robustly, via proxy-task-based evaluation that reflects real-world deployment (Li et al., 21 Jul 2025).
Proxy-Task Evaluation
BenchDepth proposes direct proxy-task evaluation in five settings, each measuring DFM efficacy for an actual application:
| Task | Relevance/Method | Highlighted Findings |
|---|---|---|
| Depth Completion | Fill dense map from sparse LiDAR + RGB | DAV2-Rel leads, surpassing “no-depth” and metric DFMs |
| Stereo Matching | Rectified pair 1 disparity map | Affine-invariant DFMs ↑ zero-shot generalization |
| 3D Reconstruction | 3DGS from single RGB + DFM depth | Metric–fine-tuned DFMs excel in view synthesis |
| SLAM | Trajectory + dense mapping from video | Relative DFMs (DAV2-Rel) enable improved mapping |
| Vision–Language | Spatial QA with image/depth | All DFMs yield similar gains, exposing VLM limits |
Representative Empirical Results
- Ranftl et al. (DPT) achieved sub-5% relative error zero-shot across 10 benchmarks (Xu et al., 15 Jul 2025).
- Metric3D v2, UniDepth: sub-0.5m absolute error outdoors, no fine-tuning.
- DEFOM-Stereo: first place on KITTI, Middlebury, and ETH3D stereo leaderboards (Jiang et al., 16 Jan 2025).
- DFMs outperform prior art in real-world zero-shot canopy height estimation with 21/6 parameters and minimal compute (Cambrin et al., 2024).
Limitations of Alignment-Based Evaluation
Standard metric alignment exaggerates differences between representation types (depth/disparity/point map), is sensitive to outliers, and advantages over-smoothed predictions. Proxy-task evaluation circumvents these issues, focusing on end-task utility (Li et al., 21 Jul 2025).
5. Specialized Domains and Cross-Modal Transfer
DFMs have been tailored for or leveraged in specialized domains:
- Surgical and Medical Vision: Fine-tuned or adapted ViT-based DFMs (e.g., Surgical-DINO using LoRA, Surgical Depth Anything) achieve SOTA depth estimation in endoscopic scenes, correcting for domain shift and specular artifacts (Cui et al., 2024, Lou et al., 2024, Han et al., 26 Jan 2026).
- Robotics: Self-supervised depth-only DFMs (DeFM) extract universal geometric features from 60M depth images, facilitating sim-to-real transfer across navigation, manipulation, and segmentation, and can be distilled into compact variants for onboard deployment (Patel et al., 26 Jan 2026).
- Depth Completion: DFM-derived priors provide dense geometric pseudo-supervision for LiDAR completion, removing the scale ambiguity and outperforming full-supervised baselines even out-of-distribution (Liang et al., 21 Mar 2025, Chen et al., 7 Aug 2025).
- Multi-View and Stereo: Integration of DFM priors as pseudo-labels, disparity initializers, or supervision significantly boosts unlabelled MVS and zero-shot stereo (Zhu et al., 16 Apr 2025, Jiang et al., 16 Jan 2025).
6. Applications and Broader Impact
DFMs underpin a broad array of vision tasks:
- 3D Reconstruction: High-fidelity point clouds and volumetric mappings for AR/VR, SLAM, and scene reconstruction (SimpleRecon, NeuRIS).
- Novel View Synthesis: DFM priors guide radiance field construction (NeRF) and Gaussian Splatting, improving rendering convergence and fidelity.
- Robotics/Autonomous Driving: Camera-only 3D perception for real-time mapping, navigation, and collision avoidance at commodity cost.
- World Modeling in Video: Video diffusion DFMs encode 3D inductive biases for future-frames synthesis and policy planning.
Additionally, DFMs dramatically increase data efficiency (e.g., MultiMAE in surgical tasks achieves higher accuracy with only 25% of labels) and enable plug-and-play downstream transfer via frozen backbones (Han et al., 26 Jan 2026).
7. Outstanding Challenges and Future Directions
Despite strong performance, DFMs face several technical and epistemological barriers (Xu et al., 15 Jul 2025, Tan et al., 21 Apr 2025):
- Data scale/diversity: Gathering or synthesizing large-scale, high-fidelity ground-truth remains costly; advances in self-supervision and synthetic realism are needed.
- Spatial and temporal consistency: Open questions remain regarding unified, volumetric representation across monocular/multi-view/video inputs.
- Multi-task Learning: Extending DFMs to predict normals, semantics, optical flow alongside depth—in a single model analogous to GPT-style LLMs—is an active frontier.
- Geometric Inductive Biases: Reconciling brute-force scaling with architecturally-imposed geometric priors (e.g., epipolar attention, neural implicit representations) is underexplored.
- Explainability: Intrinsic barriers due to scale, nonlinearity, and high-dimensional data dependence limit mechanistic interpretability; explainability may need to shift toward empirical testing and behavioral assurances (Tan et al., 21 Apr 2025).
As DFMs increase in scale and diversity, the prospect emerges of universal, “visual LiDAR”–level scene understanding from a single camera, transforming applications in world modeling, AR/VR, medical vision and robotic autonomy (Xu et al., 15 Jul 2025, Li et al., 21 Jul 2025, Patel et al., 26 Jan 2026).