Depth Anything V2: Advanced Depth Estimation

Updated 13 October 2025

Depth Anything V2 is a cutting-edge monocular depth estimation framework that leverages synthetic data and a teacher-student paradigm to achieve fine-grained, robust results.
The framework integrates a DINOv2 Vision Transformer with a Dense Prediction Transformer decoder and employs specialized losses for scale invariance and edge clarity.
Evaluations on benchmarks like DA-2K confirm its efficiency and adaptability in diverse settings, supporting both real-time applications and high-fidelity offline processing.

Depth Anything V2 is an advanced monocular depth estimation framework that establishes a foundation for fine-grained, robust depth prediction in open-world scenes. The design philosophy emphasizes high efficiency, strong generalization, and adaptability across diverse visual domains and application scenarios. The approach consolidates multiple key advances in data utilization, model architecture, supervision, and evaluation protocols, enabling state-of-the-art performance for both relative and metric depth estimation in real and synthetic environments.

1. Data-Centric Supervision and Pseudo-Label Paradigm

Depth Anything V2 departs from traditional supervised frameworks by replacing all real labeled images—often limited by noisy or coarse annotations—with photorealistic synthetic datasets possessing pixel-perfect depth ground truth. These synthetic images include challenging structures (thin objects, boundaries, reflective and transparent surfaces), which are not adequately annotated in typical real-world datasets.

To address the domain shift between synthetic and real data, the method introduces a three-stage teacher-student paradigm:

Teacher Model Training on Synthetic Data: A large-capacity teacher (DINOv2-G backbone, up to 1.3B parameters) is trained exclusively on high-fidelity synthetic image–depth pairs, exploiting the data’s precision for detailed supervision.
Pseudo-Label Generation: The teacher is used to annotate depth for a massive corpus of unlabeled real-world images (up to 62 million).
Student Model Training: Parameter-efficient student models (ranging from 25M to 1.3B parameters) are supervised with these high-quality pseudo labels, inheriting the teacher’s fine-grained representation and adapting it for real image generalization.

This supervision strategy decouples model capacity from training data quantity and quality, distributing heavy computational requirements to the teacher stage while deploying efficient student variants for broad applications (Yang et al., 13 Jun 2024).

2. Model Architecture and Loss Formulation

The architecture combines a DINOv2 Vision Transformer (ViT) for feature extraction with a Dense Prediction Transformer (DPT) decoder for depth regression. Model scalability is intrinsic, permitting models to serve latency-sensitive real-time robotics (e.g., the ViT-Small student) as well as high-fidelity large-scale offline processing (ViT-Giant).

Training is governed by:

Scale- and Shift-Invariant Loss ( $\mathcal{L}_{ssi}$ ): Enforces affine invariance, essential for relative depth estimation across variable camera calibrations or unknown scales.
Gradient Matching Loss ( $\mathcal{L}_{gm}$ ): Promotes sharp recovery of depth discontinuities and fine structures. The full objective is:

$\mathcal{L}_{total} = \mathcal{L}_{ssi} + \lambda \cdot \mathcal{L}_{gm}$

where $\lambda$ is tuned for tradeoff between global accuracy and local detail sharpness.

Feature Alignment Loss: During pseudo-label training, this auxiliary term preserves semantic consistency between the deep features of teacher and student, leveraging the pre-trained semantics of DINOv2.

Models can be fine-tuned with in-domain metric depth labels (e.g., Hypersim, Virtual KITTI) for applications requiring absolute scale. The resulting variants demonstrate strong metric accuracy on NYU-D and KITTI (Yang et al., 13 Jun 2024).

3. Generalization, Efficiency, and Benchmarking

Depth Anything V2 achieves efficient inference—over 10× faster than Stable Diffusion (SD)-based models—while yielding superior spatial detail recovery and generalization, especially on thin/complex structures and in the presence of domain shift. This is substantiated by evaluations on the DA-2K benchmark, a newly curated, high-resolution test suite:

DA-2K Benchmark: Features diverse conditions, including adverse weather, AI-generated images, and underwater/aerial scenes. Annotations are obtained through automated pair selection (using SAM masks and ensemble teacher models), then validated by human experts.
The best models demonstrate a >10% improvement in relative depth discrimination on DA-2K over strong SD-based competitors (e.g., Marigold), with significantly lower computational footprint (Yang et al., 13 Jun 2024).

Smaller models operate in under 10 ms per frame on an A100, supporting deployment at ≥30 FPS for real-time embedded or robotics platforms (Chen et al., 21 Jan 2025).

4. Extensions and Downstream Applications

Several works build on Depth Anything V2, confirming its versatility:

Metric Depth Estimation: Fine-tuning on pseudo-labeled or synthetic metric datasets enables absolute depth prediction with sharp outputs. Prompt-fusion approaches can efficiently integrate sensors like LiDAR as “prompts,” embedding metric cues at multiple scales within the decoder for up to 4K output resolution. The edge-aware loss (including both L1 and gradient terms) is critical for aligning sharp features from pseudo ground-truth (Lin et al., 18 Dec 2024).
Adverse Condition Robustness: Unsupervised consistency regularization with simulated perturbations (lighting, weather, blur) and spatial distance constraints is used to finetune robustness, yielding superior generalization across real and synthetic benchmarks in diverse environmental conditions (Sun et al., 2 Jul 2025).
Semantic Segmentation Fusion: Depth features from Depth Anything V2 can be fused into Vision Foundation Models (VFMs) (e.g., DINOv2) via depth-aware tokens and refinement decoders, boosting domain generalization for semantic segmentation, notably in extreme conditions with weak visual cues (Chen et al., 17 Apr 2025).
Specialized Domains: Fine-tuned versions deliver state-of-the-art performance for event-based depth estimation (Bartolomei et al., 18 Sep 2025), zero-shot remote sensing canopy height mapping (Cambrin et al., 8 Aug 2024), and robust monocular surgical navigation via LoRA-based adaptation and multi-scale SSIM losses (Zeinoddin et al., 30 Aug 2024, Li et al., 12 Sep 2024).
Temporal Consistency in Video: The introduction of a spatial-temporal head, temporal self-attention layers, and a temporal gradient matching loss (TGM) enables temporally consistent depth estimation in arbitrarily long videos, addressing scale drift via an overlapping inference scheme and key-frame referencing (Chen et al., 21 Jan 2025).

5. Ablation, Comparative Results, and Performance Analysis

Extensive ablations validate each component:

Teacher capacity scaling (ViT-Giant) is essential for transferring fine-grained knowledge to student models.
Pseudo-labeling with synthetic-pretrained teachers outperforms direct real-data supervision, compensating for the lack of high-quality ground-truth in many domains.
Gradient loss weighting and feature alignment substantially impact contour precision and cross-domain generalization.
For downstream applications such as wildlife monitoring, Depth Anything V2 achieves mean absolute errors as low as 0.454m with a correlation of 0.962, outperforming alternatives and maintaining favorable runtime (0.22s per image on RTX 4090). Median aggregation of predictions further improves robustness to outliers in outdoor conditions (Niccoli et al., 6 Oct 2025).
Robustness improvements for adverse weather and extreme conditions are empirically supported by accuracy gains (e.g., +2.5% absRel in ACDepth for night/rain in nuScenes) (Jiang et al., 18 May 2025).

6. Limitations and Future Directions

Although the reliance on synthetic data and teacher-student distillation addresses many label quality and scalability problems, some remaining challenges persist:

Domain-specific Fine-tuning: While models generalize well, dedicated adaptation (e.g., for wildlife or medical imagery) may yield further improvements.
Cross-modal Expansion: Current pipeline primarily distills from RGB-images. Event-based and multimodal setups can now adopt the same general approach thanks to the proven cross-modal distillation paradigm (Bartolomei et al., 18 Sep 2025).
Temporal and Adverse Robustness: While significant advances have been made in temporal consistency (Video Depth Anything) and environmental resilience, combining these with spatial priors and multi-sensor guidance (e.g., prompted by auxiliary LiDAR) remains an active direction.
Resource Optimization: Scaling down the student models without significant loss of accuracy, and improving memory/computational efficiency for embedded and real-time applications.

7. Broader Implications and Ecosystem Impact

Depth Anything V2 sets a high standard for foundation models in geometry-centric computer vision, offering robust, adaptable depth representations that serve as a backbone for advanced perception, 3D reconstruction, and semantic parsing tasks. Its modularity—distilling rich geometric priors into efficient models, and accepting a spectrum of supervision from synthetic, real, and cross-modal labels—positions it as a core component of evolving vision foundation model ecosystems. The release of high-diversity benchmarks (DA-2K) and open-source codebases accelerates research on both geometry-aware AI and applied depth estimation across scientific, industrial, and conservation domains (Yang et al., 13 Jun 2024, Wang et al., 15 May 2025, Niccoli et al., 6 Oct 2025, Chen et al., 21 Jan 2025, Sun et al., 2 Jul 2025).