Depth Anything V2: Scalable Monocular Depth Estimation

Updated 31 October 2025

Depth Anything V2 is a scalable transformer-based model for monocular depth estimation that leverages extensive synthetic data and pseudo-label distillation for robust depth inference.
The model employs a teacher-student paradigm, using a high-capacity ViT-G teacher to generate dense pseudo-labels from 62 million real images, which train smaller student models.
It achieves superior performance on the DA-2K benchmark with high accuracy, fast inference, and enhanced detail recovery, particularly in challenging scenes with thin structures and complex layouts.

Depth Anything V2 is a scalable, transformer-based foundation model for monocular depth estimation that advances both architectural design and training methodology relative to previous work in open-world depth modeling. It is explicitly designed to robustly infer dense depth from a single image across a wide variety of scenes, including highly challenging settings with thin structures, reflections, and complex spatial layouts. Depth Anything V2 achieves this by leveraging supervised synthetic data, ultra-large-scale pseudo-label distillation, and targeted architectural and loss innovations. Its generalization and accuracy are established by rigorous comparison to diffusion-based and transformer-based alternatives, extensive ablation studies, and the introduction of a high-diversity, high-quality evaluation suite.

1. Model Architecture and Capacity Scaling

Depth Anything V2 employs DINOv2 vision transformers (ViT) as encoders, with four principal model sizes:

Variant	Parameters
ViT-S	~25M
ViT-B	~86M
ViT-L	~304M
ViT-G	~1.3B

The encoder feeds into a DPT (Dense Prediction Transformer) decoder. The overall structure supports fine spatial detail reconstruction, global scene understanding, and universal architectural scalability, favoring efficiency for both lightweight and large-scale deployments.

The training pipeline follows a teacher–student paradigm. The largest model (ViT-G) is trained as a teacher on synthetic labeled data; its output supervises smaller student models on pseudo-labeled real-world images.

2. Training Methodology: Synthetic Pretraining and Large-Scale Pseudo-label Distillation

Distinguishing features of the Depth Anything V2 methodology are:

Exclusive use of synthetic images for ground-truth supervision. Unlike prior models, no labeled real images are used in the initial teacher training; instead, synthetic datasets (BlendedMVS, Hypersim, IRS, TartanAir, vKITTI2; totaling 595K images) provide perfect depths, ensuring the model learns from precise labels even for fine structures and transparent or reflective regions.
Massive-scale pseudo-labeling of real images. The ViT-G teacher generates dense depth maps for 62 million diverse, unlabeled real images (from ImageNet, LSUN, BDD100k, OpenImages, and other sources). Student models are then trained on these pseudo-labels, which transfer both high-fidelity geometric understanding and real-world distribution knowledge.
Loss functions:
- Scale- and shift-invariant loss ( $\mathcal{L}_{ssi}$ ), robust to global affine ambiguity.
- Gradient matching loss ( $\mathcal{L}_{gm}$ ), promoting sharply delineated structures, especially effective given the high-quality synthetic depth labels.
- Feature alignment loss, encouraging student encoders to preserve semantics inherited from DINOv2 pretraining during pseudo-label distillation.
- Noisy pixel filtering: Per-sample, the highest-loss 10% of pixels are ignored to mitigate the effect of uncertain teacher pseudo-labels.

Gradient matching receives a carefully tuned weight (2:1 relative to the affine-invariant loss) to optimize fine boundary and thin structure reconstruction. Synthetic data is omitted during student training to further enhance real-scene generalization.

3. Efficiency and Performance Compared to State of the Art

Depth Anything V2 demonstrates significant advantages over both Stable Diffusion-based and transformer-based baselines:

Accuracy: For example, on the DA-2K benchmark (high-res, high-diversity dataset focusing on difficult and ambiguous scene regions), ViT-S achieves 95.3% accuracy (vs. 86.8% for Marigold, a top SD-based model).
Speed and Model Size: Inference is >10× faster and models are an order of magnitude smaller (ViT-S: 25M vs. Marigold: 4100M).
Detail recovery: The combination of synthetic ground-truth and non-noisy student training yields superior performance on thin structures and boundaries relative to both MiDaS and previous Depth Anything.
Robustness: Generalizes well to real-world out-of-distribution scenes; student models trained on pseudo-labeled real images outperform models trained with imperfect real ground-truth.

4. Fine-tuning for Metric Depth: Domain Adaptation Strategy

While student models are trained on pseudo-labels aligned up to scale and shift (relative depth), Depth Anything V2 achieves strong metric performance by directly fine-tuning on application-specific labeled datasets (e.g., NYUv2, KITTI, or cleansed synthetic datasets with metric depth). This procedure involves:

Freezing the encoder from the pre-trained model,
Supervising only the decoder and final regression layers,
Using precise metric depth loss tailored to the specific domain.

Performance benchmarks show DA V2 surpasses AdaBins, ZoeDepth, SwinV2, and other popular alternatives for both NYU and KITTI.

5. The DA-2K Benchmark: Rigorous Evaluation Protocol

To address the deficiencies of existing benchmarks (noise, limited diversity, low resolution), the paper introduces DA-2K:

1,000 high-resolution images spanning eight scene types (indoor, outdoor, AI-generated, transparent/reflective, adverse styles, aerial, underwater, object-centric),
2,000 carefully selected sparse, human-verified depth pairs per image, focusing on region-pairs where SOTA models disagree most,
Designed as a stress test for fine-grained depth reasoning in challenging and ambiguous scenarios.

Model	DA-2K Accuracy (%)
Marigold (SD-based)	86.8
GeoWizard (SD-based)	88.1
DepthFM (SD-based)	85.8
Depth Anything V1	88.5
Depth Anything V2 ViT-S	95.3
...V2 ViT-B	97.0
...V2 ViT-L	97.1
...V2 ViT-G	97.4

6. Key Innovations and Design Choices

Several findings are critical to DA V2's gains:

Synthetic-only teacher training is superior to mixing real-world labeled data, as real ground truth is often noisy or incomplete (especially for fine details).
Large-capacity teacher models (ViT-G, 1.3B) supply better pseudo-labels, especially for complex or ambiguous regions.
Diversity in pseudo-labeled data sources is crucial; single-source or oversampled datasets reduce generalization.
Strong weighting of gradient matching loss tightens boundary detail when used with synthetic (not real) ground-truth.
No architectural changes are needed between application domains—fine-tuning or swapping heads enables quick adaptation for metric prediction tasks.

7. Applicability and Benchmarking in Broader Domains

A substantial portion of the paper demonstrates matches or surpasses to SOTA in:

Zero-shot generalization (across NYUv2, KITTI, Sintel, DDAD, ETH3D, DIODE, etc.),
Metric depth after domain adaptation or fine-tuning on application-specific data,
Real-world transfer tasks, including segmentation and other mid-level perception,
Efficient, scalable deployment, including for resource-constrained environments due to lightweight model variants.

Depth Anything V2, through systematic architectural and training design, establishes a new paradigm and a robust empirical foundation for scalable monocular depth estimation, validated rigorously against prior SOTA and under diverse, stress-test conditions (Yang et al., 13 Jun 2024).

PDF Markdown Chat (Pro)

References (1)

Depth Anything V2 (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Depth Anything V2 Model.