Papers
Topics
Authors
Recent
2000 character limit reached

DA-V2: Advanced Monocular Depth Estimation

Updated 26 December 2025
  • Depth Anything V2 (DA-V2) is a monocular depth estimation model that leverages synthetic data and pseudo-labels to generate highly accurate and robust depth maps.
  • It employs a three-stage training pipeline, including a synthetic-trained teacher, large-scale pseudo-labeling on 62M real images, and student model distillation.
  • DA-V2 achieves state-of-the-art zero-shot and fine-tuned performance, demonstrating scalability, efficiency, and practical applicability across various benchmarks.

Depth Anything V2 (DA-V2) is a monocular depth estimation foundation model that achieves state-of-the-art performance by leveraging pure synthetic supervision, large-scale pseudo-labeled real images, and a high-capacity transformer backbone. DA-V2 marks a substantial advancement over prior monocular estimators by producing accurate, robust, and visually coherent depth maps entirely without reliance on real-world labeled data or diffusion-based inference. The model is architected for scalability, efficiency, and generalization, with broad applicability in both zero-shot and fine-tuned settings across vision, robotics, and generative tasks.

1. Pipeline Overview and Motivation

DA-V2 is a three-stage architecture and training regime specifically designed to decouple the need for costly real-world depth annotations:

  1. Teacher Model Training on Synthetic Data: All labeled supervision is replaced with synthetic RGB-depth datasets—BlendedMVS, Hypersim, IRS, TartanAir, and Virtual KITTI 2—providing 595K perfectly aligned pairs. The teacher uses a very large vision transformer (ViT-G, 1.3B parameters) with a DPT-style decoder, trained with scale- and shift-invariant (MiDaS) and gradient-matching losses.
  2. Large-Scale Pseudo-Labeling of Real Images: The synthetic-trained teacher is run on 62 million real, unlabeled images sampled from eight diverse datasets (e.g., ImageNet-21K, LSUN, SA-1B). The teacher produces inverse depth pseudo-labels, filtering out the 10% most unreliable pixels per sample, thus generating a massive, automatically annotated dataset with broad scene diversity.
  3. Student Model Distillation on Pseudo-Labeled Data: Specialized student models (ViT-Small, ViT-Base, ViT-Large, and ViT-Giant; 25M–1.3B parameters) are distilled from the teacher on the pseudo-labeled real images. Core to this stage is the use of MiDaS losses and a feature alignment term that preserves rich semantic information from the DINOv2 backbone.

This synthetic → pseudo-real → distilled real training path obviates the need for real labeled depth and demonstrates high fidelity across diverse in-the-wild inputs (Yang et al., 13 Jun 2024).

2. Model Architecture and Losses

Encoder–Decoder Backbone

  • Encoder: DINOv2-ViT (S/B/L/G), pre-trained on unlabeled data for extensive semantic priors.
  • Decoder: DPT-style four-stage multi-scale fusion, upsampling, and convolution, permitting high-resolution, sharp depth maps at full image size.
  • Depth Head: Single 1×1 convolution outputs a dense per-pixel depth field.

Training Losses

  • Scale- and Shift-Invariant Loss (MiDaS):

Lssi(d,d^)=1Ni(didˉ)(d^id^)\mathcal{L}_{ssi}(d, \hat d) = \frac{1}{N}\sum_i | (d_i - \bar d) - (\hat d_i - \overline{\hat d}) |

where dd and d^\hat d denote (predicted, pseudo-labeled) inverse depths; means are subtracted to enforce invariance.

  • Gradient-Matching Loss:

Lgm(d,d^)=1Ni(xdixd^i+ydiyd^i)\mathcal{L}_{gm}(d, \hat d) = \frac{1}{N} \sum_i ( |\partial_x d_i - \partial_x \hat d_i| + |\partial_y d_i - \partial_y \hat d_i| )

promoting sharpness and boundary integrity.

  • Feature Alignment Loss:

Lfa(fs,ft)=1Mj=1Mfsjftj22\mathcal{L}_{fa}(f_s, f_t) = \frac{1}{M}\sum_{j=1}^M \| f_s^j - f_t^j \|_2^2

leveraging mid-level feature similarity to transfer teacher semantics.

Student training minimizes the sum: L=Lssi+λgmLgm+λfaLfa\mathcal{L} = \mathcal{L}_{ssi} + \lambda_{gm} \mathcal{L}_{gm} + \lambda_{fa} \mathcal{L}_{fa}, with λgm ⁣= ⁣2.0\lambda_{gm}\!=\!2.0, λfa ⁣ ⁣1.0\lambda_{fa}\!\approx\!1.0 (Yang et al., 13 Jun 2024).

3. Data Strategy: Synthetic Pretraining and Pseudo-Label Scaling

DA-V2’s primary innovation is achieving robust generalization using only synthetic and pseudo-labeled real data:

  • Synthetic Data (595K images): Each source comprises photorealistic RGB/depth pairs with full 3D mesh alignment and fine structure.
  • Real-World Pseudo-Labels (62M images): True depth in real images is approximated by teacher predictions, after scale rectification and outlier filtering. The student models never see real labeled depth, yet learn directly on real-world data statistics and structure.
  • Noise Mitigation: The strategy of masking high-loss pixels in pseudo labels and feature alignment acts to counteract outlier and distributional drift.

The result is a student capable of state-of-the-art zero-shot and fine-tuned depth inference in real, unconstrained scenarios.

4. Empirical Performance and Benchmarks

DA-V2 sets new standards on both zero-shot and task-specific depth estimation:

  • Zero-Shot Relative Depth (DA-2K benchmark): DA-V2-L achieves 97.1% accuracy; prior SD-based methods such as Marigold and GeoWizard reach ≤88.1%.
  • Absolute Metric Depth (fine-tuned): On NYUv2 and KITTI,
    • NYUv2: AbsRel 0.056, RMSE 0.206, δ1\delta_1 0.984
    • KITTI: AbsRel 0.046, RMSE 1.896, δ1\delta_1 0.982
    • These results surpass in-domain baselines and diffusion-based competitors (Yang et al., 13 Jun 2024).
  • Speed: DA-V2-Small delivers real-time (≤30 ms per image on RTX 3090) throughput, 10–15× faster than Marigold and other diffusion-based methods.
  • Qualitative: Outputs are sharper, less noisy, and more robust to challenging content (thin structures, unusual styles, adverse weather) than previous MiDaS-style networks.

A summary table for DA-2K evaluation:

Model DA-2K Accuracy (%)
DA-V2-Small 95.3
DA-V2-Base 97.0
DA-V2-Large 97.1
DA-V2-Giant 97.4
Marigold 86.8
GeoWizard 88.1

5. Design Choices, Ablations, and Limitations

Ablation studies confirm:

  • Synthetic-only teacher without pseudo-labeled real data transfers poorly (KITTI AbsRel 0.104 vs. 0.078 with pseudo-real).
  • Stronger gradient-matching loss directly improves thin-structure fidelity.
  • Students trained with only pseudo labels often outperform those mixing synthetic and pseudo-real images.
  • Speed/accuracy trade-offs are directly controlled by model scale.

Known limitations:

  • Despite scale invariance, local scale ambiguity remains for challenging scenes (e.g., strong reflectors).
  • Training does not render synthetic data: all synthetic sources must provide ground-truth depth.
  • Pseudo-label validity is primarily regulated by masking, but sampling errors can still propagate for unusual content.

6. Applications and Extensions

DA-V2 is foundational for numerous compositional and generative pipelines:

7. Impact, Benchmarks, and Future Prospects

DA-V2 represents a paradigm shift in self-supervised monocular depth—showing that, with high-quality synthetic supervision, large-scale pseudo labeling, and careful distillation, it is possible to match or exceed the performance of diffusion-based and real-supervised baselines at a fraction of the computational cost. The public release of models, code, and the DA-2K benchmark facilitates transparent, reproducible research and direct comparison for future work (Yang et al., 13 Jun 2024).

Potential directions include domain-adaptive fine-tuning for medical/endoscopic imagery (Li et al., 12 Sep 2024, Han et al., 29 Jan 2024), even larger unified models for vision–language–geometry tasks, and integration with multi-modal prompt-based systems for 3D-aware reasoning and generation.


References

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Depth Anything V2 (DA-V2).