DA-V2: Advanced Monocular Depth Estimation
- Depth Anything V2 (DA-V2) is a monocular depth estimation model that leverages synthetic data and pseudo-labels to generate highly accurate and robust depth maps.
- It employs a three-stage training pipeline, including a synthetic-trained teacher, large-scale pseudo-labeling on 62M real images, and student model distillation.
- DA-V2 achieves state-of-the-art zero-shot and fine-tuned performance, demonstrating scalability, efficiency, and practical applicability across various benchmarks.
Depth Anything V2 (DA-V2) is a monocular depth estimation foundation model that achieves state-of-the-art performance by leveraging pure synthetic supervision, large-scale pseudo-labeled real images, and a high-capacity transformer backbone. DA-V2 marks a substantial advancement over prior monocular estimators by producing accurate, robust, and visually coherent depth maps entirely without reliance on real-world labeled data or diffusion-based inference. The model is architected for scalability, efficiency, and generalization, with broad applicability in both zero-shot and fine-tuned settings across vision, robotics, and generative tasks.
1. Pipeline Overview and Motivation
DA-V2 is a three-stage architecture and training regime specifically designed to decouple the need for costly real-world depth annotations:
- Teacher Model Training on Synthetic Data: All labeled supervision is replaced with synthetic RGB-depth datasets—BlendedMVS, Hypersim, IRS, TartanAir, and Virtual KITTI 2—providing 595K perfectly aligned pairs. The teacher uses a very large vision transformer (ViT-G, 1.3B parameters) with a DPT-style decoder, trained with scale- and shift-invariant (MiDaS) and gradient-matching losses.
- Large-Scale Pseudo-Labeling of Real Images: The synthetic-trained teacher is run on 62 million real, unlabeled images sampled from eight diverse datasets (e.g., ImageNet-21K, LSUN, SA-1B). The teacher produces inverse depth pseudo-labels, filtering out the 10% most unreliable pixels per sample, thus generating a massive, automatically annotated dataset with broad scene diversity.
- Student Model Distillation on Pseudo-Labeled Data: Specialized student models (ViT-Small, ViT-Base, ViT-Large, and ViT-Giant; 25M–1.3B parameters) are distilled from the teacher on the pseudo-labeled real images. Core to this stage is the use of MiDaS losses and a feature alignment term that preserves rich semantic information from the DINOv2 backbone.
This synthetic → pseudo-real → distilled real training path obviates the need for real labeled depth and demonstrates high fidelity across diverse in-the-wild inputs (Yang et al., 13 Jun 2024).
2. Model Architecture and Losses
Encoder–Decoder Backbone
- Encoder: DINOv2-ViT (S/B/L/G), pre-trained on unlabeled data for extensive semantic priors.
- Decoder: DPT-style four-stage multi-scale fusion, upsampling, and convolution, permitting high-resolution, sharp depth maps at full image size.
- Depth Head: Single 1×1 convolution outputs a dense per-pixel depth field.
Training Losses
- Scale- and Shift-Invariant Loss (MiDaS):
where and denote (predicted, pseudo-labeled) inverse depths; means are subtracted to enforce invariance.
- Gradient-Matching Loss:
promoting sharpness and boundary integrity.
- Feature Alignment Loss:
leveraging mid-level feature similarity to transfer teacher semantics.
Student training minimizes the sum: , with , (Yang et al., 13 Jun 2024).
3. Data Strategy: Synthetic Pretraining and Pseudo-Label Scaling
DA-V2’s primary innovation is achieving robust generalization using only synthetic and pseudo-labeled real data:
- Synthetic Data (595K images): Each source comprises photorealistic RGB/depth pairs with full 3D mesh alignment and fine structure.
- Real-World Pseudo-Labels (62M images): True depth in real images is approximated by teacher predictions, after scale rectification and outlier filtering. The student models never see real labeled depth, yet learn directly on real-world data statistics and structure.
- Noise Mitigation: The strategy of masking high-loss pixels in pseudo labels and feature alignment acts to counteract outlier and distributional drift.
The result is a student capable of state-of-the-art zero-shot and fine-tuned depth inference in real, unconstrained scenarios.
4. Empirical Performance and Benchmarks
DA-V2 sets new standards on both zero-shot and task-specific depth estimation:
- Zero-Shot Relative Depth (DA-2K benchmark): DA-V2-L achieves 97.1% accuracy; prior SD-based methods such as Marigold and GeoWizard reach ≤88.1%.
- Absolute Metric Depth (fine-tuned): On NYUv2 and KITTI,
- NYUv2: AbsRel 0.056, RMSE 0.206, 0.984
- KITTI: AbsRel 0.046, RMSE 1.896, 0.982
- These results surpass in-domain baselines and diffusion-based competitors (Yang et al., 13 Jun 2024).
- Speed: DA-V2-Small delivers real-time (≤30 ms per image on RTX 3090) throughput, 10–15× faster than Marigold and other diffusion-based methods.
- Qualitative: Outputs are sharper, less noisy, and more robust to challenging content (thin structures, unusual styles, adverse weather) than previous MiDaS-style networks.
A summary table for DA-2K evaluation:
| Model | DA-2K Accuracy (%) |
|---|---|
| DA-V2-Small | 95.3 |
| DA-V2-Base | 97.0 |
| DA-V2-Large | 97.1 |
| DA-V2-Giant | 97.4 |
| Marigold | 86.8 |
| GeoWizard | 88.1 |
5. Design Choices, Ablations, and Limitations
Ablation studies confirm:
- Synthetic-only teacher without pseudo-labeled real data transfers poorly (KITTI AbsRel 0.104 vs. 0.078 with pseudo-real).
- Stronger gradient-matching loss directly improves thin-structure fidelity.
- Students trained with only pseudo labels often outperform those mixing synthetic and pseudo-real images.
- Speed/accuracy trade-offs are directly controlled by model scale.
Known limitations:
- Despite scale invariance, local scale ambiguity remains for challenging scenes (e.g., strong reflectors).
- Training does not render synthetic data: all synthetic sources must provide ground-truth depth.
- Pseudo-label validity is primarily regulated by masking, but sampling errors can still propagate for unusual content.
6. Applications and Extensions
DA-V2 is foundational for numerous compositional and generative pipelines:
- Prompted and Prior-Guided Depth: Used as a backbone for LiDAR-prompted 4K metric estimation (“Prompt Depth Anything”) (Lin et al., 18 Dec 2024), hybrid metric-aligned super-resolution (Wang et al., 15 May 2025), and uncertainty-enhanced sensor fusion (Jun et al., 5 Jun 2025).
- Panoramic and Any-View Depth: Adapted for panoramic (DA) and arbitrary-multipose depth estimation (DA3) (Li et al., 30 Sep 2025, Lin et al., 13 Nov 2025).
- Downstream Tasks: Used for 3D-consistent ControlNet guidance, vision-language symbolic reasoning (DAM+SAM+GPT-4V), and as an implicit prior for segmentation and canopy estimation in remote sensing (Huo et al., 7 Jun 2024, Cambrin et al., 8 Aug 2024, Zheng et al., 3 Feb 2024).
- Efficient Scaling: Available as off-the-shelf models from 25M to 1.3B parameters, with reproducible training code and a diverse benchmark (DA-2K) for evaluation.
7. Impact, Benchmarks, and Future Prospects
DA-V2 represents a paradigm shift in self-supervised monocular depth—showing that, with high-quality synthetic supervision, large-scale pseudo labeling, and careful distillation, it is possible to match or exceed the performance of diffusion-based and real-supervised baselines at a fraction of the computational cost. The public release of models, code, and the DA-2K benchmark facilitates transparent, reproducible research and direct comparison for future work (Yang et al., 13 Jun 2024).
Potential directions include domain-adaptive fine-tuning for medical/endoscopic imagery (Li et al., 12 Sep 2024, Han et al., 29 Jan 2024), even larger unified models for vision–language–geometry tasks, and integration with multi-modal prompt-based systems for 3D-aware reasoning and generation.
References
- [Depth Anything V2 (DA-V2)]
- For broader context and downstream applications: (Lin et al., 18 Dec 2024, Wang et al., 15 May 2025, Jun et al., 5 Jun 2025, Lin et al., 13 Nov 2025, Huo et al., 7 Jun 2024, Zheng et al., 3 Feb 2024, Cambrin et al., 8 Aug 2024, Li et al., 30 Sep 2025, Li et al., 12 Sep 2024).