Depth Anything v2: Monocular Depth Estimation

Updated 11 April 2026

Depth Anything v2 is a state-of-the-art monocular depth estimation model leveraging a Vision Transformer encoder and a DPT decoder to predict relative and metric depth from single RGB images.
It employs a synthetic-to-real teacher-student training pipeline with scale- and shift-invariant log loss and gradient matching to ensure accuracy and robustness across diverse domains.
The model offers versatile adaptations for amodal, panoramic, video, and prompt-guided tasks, achieving superior efficiency and real-time performance with minimal architectural modifications.

Depth Anything v2 (DAV2) is a monocular depth estimation family that achieves state-of-the-art efficiency and generalization in predicting relative and metric depth from a single RGB image. It leverages a Vision Transformer (ViT) encoder and a Dense Prediction Transformer (DPT) decoder, trained in a scalable synthetic-to-real framework, and serves as a foundation architecture for a wide range of depth estimation paradigms—including amodal, panoramic, video-consistent, and prompt-guided metric tasks. Depth Anything v2 is characterized by its scale- and shift-invariant learning, multi-scale feature fusion, and its adaptability to diverse domains with minimal architectural modification.

1. Core Architecture

The primary design of Depth Anything v2 employs a standard DINOv2 Vision Transformer backbone at varying scales (ViT-Small, ViT-Base, ViT-Large, ViT-Giant; 25M–1.3B parameters), paired with a lightweight DPT-style upsampling decoder. The processing pipeline is as follows:

An input RGB image (H×W×3) passes through a 7×7, stride-4 convolution to produce patch embeddings.
The resulting tokens are propagated through L transformer layers, with self-attention and MLP sublayers, delivering multi-scale feature maps.
Feature maps from selected ViT layers are reprojected (via 1×1 convs) to spatial resolutions matching those required for decoding.
The DPT decoder fuses these multi-scale features with a feature pyramid structure, using 3×3 convolutions and bilinear upsampling, ultimately regressing a dense (H×W×1) depth map.

The architecture supports input normalization and resizing, with shorter edge rescaled to 384 px and center-cropped to 384×384; RGB channels are normalized by ImageNet mean and standard deviation (Li et al., 2024, Yang et al., 2024).

2. Synthetic-to-Real Training Pipeline

Depth Anything v2 is distinguished by a two-stage teacher–student training process that unifies synthetic and real sources:

The teacher model (ViT-Giant) is trained on high-quality synthetic 3D datasets (BlendedMVS, Hypersim, IRS, TartanAir, vKITTI-2; 595k images) using a scale- and shift-invariant log loss:

$\mathcal{L}_{ssi} = \frac{1}{N}\sum_i r_i^2 - \frac{1}{N^2}\left(\sum_i r_i\right)^2,\quad r_i = \log d^p_i - \log d^g_i$

and a gradient matching loss

$\mathcal{L}_{gm} = \sum_i \left| \nabla_x(d^p_i)-\nabla_x(d^g_i) \right|_1 + \left| \nabla_y(d^p_i)-\nabla_y(d^g_i) \right|_1$

with $\lambda=2$ (Yang et al., 2024).

The trained teacher generates pseudo-labels for 62M real images from diverse corpora, filtering out the noisiest regions (top 10% by per-image training loss).
Smaller student models (ViT-S/B/L) are then trained from scratch on only these pseudo-labeled real images for 480k steps, optimizing the same losses without synthetic samples.

This pipeline confers strong generalization, fine detail preservation, and open-world robustness (Yang et al., 2024).

3. Loss Functions and Invariance

Depth Anything v2’s primary loss is the scale- and shift-invariant log-loss ( $\mathcal{L}_{ssi}$ ), which ensures that predictions are correct up to an unknown affine transformation, thus abstracting away global scale bias from the model. The additional gradient loss regularizes local consistency and sharp edges. For metric depth tasks, the same architecture and training procedure are maintained, but fine-tuning is performed on datasets with metric ground truth—such as NYU-V2 and KITTI—using $\mathcal{L}_{ssi}$ and $\mathcal{L}_{gm}$ , leading to accurate absolute depth estimates (Yang et al., 2024).

Zero-shot evaluation employs a robust, scale-invariant error computed after aligning the predicted and ground-truth depth maps for arbitrary scale and shift. This is preserved in panoramic and video depth extensions (Jiang et al., 28 Dec 2025, Chen et al., 21 Jan 2025).

4. Model Variants and Adaptations

Depth Anything v2’s design enables modular adaptation to specialized domains and tasks, often requiring only minor architectural or input changes. Notable variants include:

Amodal-DAV2: For amodal depth estimation, pairs the core encoder–decoder with extra “guidance” input channels (observed-depth and amodal mask), realized as a zero-initialized parallel convolution in the input layer and the addition of a LayerNorm before decoding. Fine-tuning with $\mathcal{L}_{ssi}$ on all amodal pixels enables robust reasoning about invisible geometry (Li et al., 2024).
Panoramic (DA360): Adapts DAV2 for $360^\circ$ equirectangular projection by substituting all upsampling CNN padding with “circular” padding, and introducing a lightweight MLP on the ViT class token to regress a global shift correction. This ensures spatial continuity at ERP seams and produces scale-invariant disparity for direct 3D reconstruction (Jiang et al., 28 Dec 2025).
Video Depth (VDA): Replaces the DPT decoder with a head containing temporal self-attention layers applied at low resolutions, enforcing consistency across long sequences. Training employs a temporal gradient matching loss and overlapping key-frame inference for super-long video stability (Chen et al., 21 Jan 2025).
Prompting (PromptDA): Incorporates low-cost LiDAR or other metric cues at multi-scale points in the decoder, using lightweight, elementwise addition-based fusion blocks. The rest of the pretrained network remains unchanged, and weak metric supervision data is simulated or pseudo-labeled for training (Lin et al., 2024).
Test-Time Refinement: “Re-Depth Anything” introduces test-time SDS-based self-supervision with re-lighting augmentation, where only intermediate ViT embeddings and the DPT decoder are updated, substantially increasing performance on OOD images (Bhattarai et al., 19 Dec 2025).
Metric Priors Fusion (Prior Depth Anything): Implements a two-stage pixel-wise alignment and refinement pipeline to merge incomplete but accurate priors with full relative depth predictions, using a conditioned MDE network for detail sharpening and denoising (Wang et al., 15 May 2025).
Domain Specialization: Domain-specific LoRA-style adapters, e.g., Vector-LoRA for endoscopy or RVLoRA with Res-DSC for medical images, facilitate data-efficient, non-forgetful adaptation to environments where direct fine-tuning would degrade generalization (Zeinoddin et al., 2024, Li et al., 2024).

5. Evaluation and Performance Benchmarks

Depth Anything v2 consistently demonstrates superior accuracy and efficiency relative to previous monocular depth estimators:

Relative Depth: On the DA-2K benchmark (open-world depth-pair matching), DA V2-L achieves 97.1% accuracy, outperforming Marigold, GeoWizard, DepthFM, and MiDaS (Yang et al., 2024).
Metric Depth: Fine-tuned on NYU-V2, ViT-Large achieves $\delta_1=0.984$ , AbsRel $=0.056$ ; on KITTI, $\mathcal{L}_{gm} = \sum_i \left| \nabla_x(d^p_i)-\nabla_x(d^g_i) \right|_1 + \left| \nabla_y(d^p_i)-\nabla_y(d^g_i) \right|_1$ 0, AbsRel $\mathcal{L}_{gm} = \sum_i \left| \nabla_x(d^p_i)-\nabla_x(d^g_i) \right|_1 + \left| \nabla_y(d^p_i)-\nabla_y(d^g_i) \right|_1$ 1 (Yang et al., 2024).
Efficiency: DA V2-S runs at ≈30 FPS on a 1080 Ti GPU, providing a >10× speedup over diffusion-based models.
Panoramic: DA360 achieves 0.0793 AbsRel (Matterport3D) and 0.0710 (Stanford2D3D), >50% error reduction over unadapted DAV2 and ≈30–37% over PanDA (Jiang et al., 28 Dec 2025).
Video: VDA-L achieves 0.083 AbsRel and 0.944 $\mathcal{L}_{gm} = \sum_i \left| \nabla_x(d^p_i)-\nabla_x(d^g_i) \right|_1 + \left| \nabla_y(d^p_i)-\nabla_y(d^g_i) \right|_1$ 2 on KITTI video (32 frames per window) at 15 FPS (Chen et al., 21 Jan 2025).
Domain-Specific: DARES (endoscopy) reduces AbsRel on SCARED to 0.052 (Vector-LoRA) from 0.060 (AF-SfMLearner) (Zeinoddin et al., 2024); PriorDA provides SOTA zero-shot depth completion, inpainting, and super-resolution without retraining (Wang et al., 15 May 2025).

6. Ablation Studies and Empirical Insights

Extensive ablation studies underscore critical factors:

Guidance Inputs: Amodal-DAV2 requires both the amodal mask and observed depth, with RMSE dropping from 7.55 (base) to 3.68 (full) (Li et al., 2024).
Full vs. Partial Supervision: Supervising on all pixels, not only occluded regions, improves consistency (RMSE=3.68 vs. 3.85).
Inference Alignment: Post-hoc scale/shift alignment does not improve results, indicating the network learns robust internal scaling.
Architectural Minimalism: Nearly all domain specializations (amodal, panoramic, LiDAR-prompted) require only a few added layers or parameters, with most of the core architecture frozen.
Synthetic Data Superiority: Synthetic GT yields sharper supervision (especially for transparency, thin structures) than real labels; pseudo-labeled real images bridge domain generalization (Yang et al., 2024).
Test-Time Flexibility: Swapping backbone or refinement architecture at inference yields resource–accuracy tradeoffs without retraining (Wang et al., 15 May 2025).

7. Impact, Limitations, and Future Directions

Depth Anything v2 establishes a new state-of-the-art architecture for scalable, generalizable monocular depth foundation models, adapted with minimal domain-specific engineering. Its influence extends to panoramic understanding, video consistency, robotic perception, remote sensing, and human-centered scene understanding.

Noted limitations include compute overhead for massive pseudo-labeling, domain gaps in synthetic data, and residual ambiguity in unknown camera intrinsics and extreme FOVs. Directions for future work include more efficient sampling or curriculum selection, explicit handling of missing intrinsics, and extension to dynamic/temporal SLAM systems and multimodal fusion (Yang et al., 2024, Wang et al., 15 May 2025, Jiang et al., 28 Dec 2025).