Depth Anything v2: Monocular Depth Estimation
- Depth Anything v2 is a foundation monocular depth estimation system that uses a discriminative ViT architecture and synthetic data with teacher–student pseudo-labeling to produce fine-grained, metrically accurate depth maps.
- The methodology incorporates scale- and shift-invariant loss functions along with gradient matching to ensure edge fidelity and enable superior zero-shot generalization and metric recovery.
- Evaluation benchmarks demonstrate that Depth Anything v2 outperforms diffusion-based and earlier discriminative models, achieving real-time inference, high accuracy, and broad domain adaptability.
Depth Anything v2 is a foundation monocular depth estimation (MDE) system designed to produce robust, fine-grained, and metrically accurate depth maps from single RGB images across diverse real-world scenarios. Unlike prior approaches reliant on noisy real labels or heavy generative modeling, Depth Anything v2 uses a discriminative vision transformer (ViT) architecture trained with synthetic GT sources and a scale-up teacher–student pseudo-labeling framework. The result is a suite of models that outperform diffusion-based and earlier discriminative baselines in both efficiency and universal applicability, with demonstrated advances in zero-shot generalization, metric depth recovery, and modular adaptability with priors and prompt-based fusion (Yang et al., 13 Jun 2024).
1. Core Training Strategy: Synthetic Data and Teacher–Student Framework
Depth Anything v2 replaces all labeled real images with photorealistic synthetic data for initial teacher model training. Five synthetic datasets—BlendedMVS, Hypersim, IRS, TartanAir, VKITTI2—provide precise ground-truth depths encompassing thin structures and challenging materials.
A ViT-Giant encoder (DINOv2-G, 1.28B params) paired with a DPT decoder forms the teacher; this model trains on synthetic-only data with a combination of scale- and shift-invariant (MiDaS-style) loss:
augmented by a gradient-matching term for edge fidelity:
Student models of varying scale (ViT-Small, Base, Large, Giant) are then trained solely on pseudo-labeled real images: the teacher annotates 62M frames spanning BDD100K, ImageNet-21K, LSUN, Objects365, OpenImages, Places365, GoogleLandmarks, SA-1B. “Bridge” pseudo-labeling closes the domain gap and distills robust depth prediction into compressed variants (Yang et al., 13 Jun 2024).
2. Model Architecture and Loss Formulations
Both teacher and student employ a ViT backbone with patch embedding, multi-head self-attention, and residual feed-forward MLP blocks. Depth decoding uses hierarchical up-sampling and multi-scale fusion (DPT). For metric depth estimation, models are fine-tuned on datasets with absolute scale (NYU, KITTI, Hypersim) using a direct L1 regression loss and optional photometric consistency when available:
Metric scale is applied with a fixed learned parameter to the predicted relative depth:
without requiring reference objects at test time (Niccoli et al., 6 Oct 2025).
3. Fine-Grained Generalization and Benchmark Performance
Depth Anything v2 achieves state-of-the-art performance on relative and metric benchmarks with efficient inference:
| Model | AbsRel ↓ | ↑ | Speed (ms/img) | #Params (M) |
|---|---|---|---|---|
| DA V2 ViT-L (rel) | 0.074 | 0.946 | 10 | 304 |
| DA V2 ViT-L (metric) | 0.045–0.056 | 0.983–0.984 | 10–220 | 304 |
| Generative SD models | >0.08 | <0.90 | 100–300 | >8,000 |
Median-based depth extraction is preferred for robust camera-trap wildlife monitoring (MAE = 0.454 m, ), outperforming ZoeDepth and geometric baselines in natural outdoor scenes. Real-time throughput is attainable on consumer GPUs, with smaller student models offering speed–accuracy tradeoff and backward compatibility as new monocular predictors emerge (Niccoli et al., 6 Oct 2025).
4. Robustness, Domain Adaptation, and Auxiliary Priors
Depth Anything v2 and its derivatives address hard open-world settings, such as adverse weather, corrupted sensors, and medical/surgical domains:
- DepthAnything-AC employs unsupervised consistency regularization and a spatial distance constraint. The combined loss penalizes disparities under heavy augmentation, maintaining zero-shot robustness in night/fog/rain and preserving semantic boundaries under perturbation (Sun et al., 2 Jul 2025).
- Prior Depth Anything provides a universal coarse-to-fine prior fusion pipeline by pixel-level metric alignment and distance-weighted filling, generalizing to mixed, unseen, or arbitrary metric priors (sparse points, downsampled grids, holes) (Wang et al., 15 May 2025).
- Prompt Depth Anything injects multi-scale LiDAR prompts for accurate metric depth up to 4K and achieves state-of-the-art results on ARKitScenes and ScanNet++. Prompt fusion is realized via scale-specific convolutional embedding and additive feature blending within the decoder (Lin et al., 18 Dec 2024).
- Event-based distillation adapts Depth Anything v2 to monocular event cameras via cross-modal proxy label generation and ConvLSTM recurrence, achieving competitive performance without ground-truth depth (Bartolomei et al., 18 Sep 2025).
- SRFT-GaLore adaptation enables efficient transformer fine-tuning in high-dimensional medical segmentation (liver landmarks) by subsampled randomized Fourier projection, cross-attention RGB–depth feature fusion, and empirical gains in Dice similarity and surface distance (Lin et al., 5 Nov 2025).
- Low-rank adaptation strategies (RVLoRA, Vector-LoRA, EndoDAC) minimize catastrophic forgetting and tune only a small fraction of parameters for surgical/endoscopic domains, yielding superior SCARED and Hamlyn metrics at tiny trainable parameter count (Li et al., 12 Sep 2024, Zeinoddin et al., 30 Aug 2024).
5. Evaluation Benchmarks and Zero-Shot Transfer
A new DA-2K evaluation set with 2,000 annotated pixel-pairs in 1,000 high-resolution images rigorously tests across indoor, outdoor, transparent, adverse, aerial, underwater, object-centric, and synthetic (“non-real”) scenes. DA V2 surpasses legacy and SD-based models:
| Scenario | DA V2 Accuracy (%) |
|---|---|
| Indoor | 96.4 |
| Outdoor | 93.9 |
| Transparent | 96.3 |
| Adverse | 97.3 |
| Non-real | 99.0 |
| Aerial | 99.5 |
| Underwater | 99.2 |
| Object-centric | 98.0 |
| Mean | 97.1 |
V2’s robustness extends to AR overlays, robotic 3D reconstruction, grasping on transparent/specular objects, and edge-preserving tissue recovery in medical imaging (Yang et al., 13 Jun 2024, Lin et al., 18 Dec 2024, Lin et al., 5 Nov 2025).
6. Efficiency, Limitations, and Future Perspectives
Depth Anything v2 models require substantial pre-training compute (∼100 GPU-days for full pipeline), but inference is an order-of-magnitude faster than diffusion-based systems. The training strategy mitigates synthetic-to-real domain gap, but coverage of underrepresented scenes (humans, underwater, AIGC) remains bounded by existing synthetic datasets. Current limitations include residual errors on thin/reflection structures and heavy augmentation needs for robustness in adverse domains.
Plausible future directions suggested by the authors include expanded synthetic pre-training, integration of multi-modal cues (surface normals, semantics), formal paper of test-time resolution scaling, and curriculum/active learning over vast unlabeled sets (Yang et al., 13 Jun 2024).
7. Conclusion
Depth Anything v2 establishes a discriminative MDE foundation capable of universal deployment, modular adaptation, and high-resolution metric depth with minimal overhead. Its architecture, training pipeline, and derived systems set efficiency and accuracy baselines for monocular depth estimation, fusion with priors/prompts, and downstream transfer to environmental, robotics, and medical segmentation tasks (Yang et al., 13 Jun 2024, Niccoli et al., 6 Oct 2025, Lin et al., 18 Dec 2024, Sun et al., 2 Jul 2025, Wang et al., 15 May 2025, Lin et al., 5 Nov 2025, Li et al., 12 Sep 2024, Zeinoddin et al., 30 Aug 2024, Bartolomei et al., 18 Sep 2025).