Depth Anything V2 Metric-Outdoor

Updated 14 January 2026

Depth Anything V2 Metric-Outdoor is a monocular metric depth estimation model for outdoor scenes that leverages synthetic pretraining, massive pseudo-labeling, and targeted fine-tuning.
It utilizes a DINOv2 vision transformer backbone with a DPT-style decoder to achieve high-resolution, low-latency inference and state-of-the-art results in key outdoor benchmarks.
The model is applicable to various domains including autonomous driving, remote sensing, and wildlife monitoring, providing dense, accurate metric depth maps from single RGB images.

Depth Anything V2 Metric-Outdoor is a monocular metric depth estimation model optimized for outdoor scene understanding by leveraging a staged synthetic-pretraining and large-scale pseudo-labeling pipeline, followed by targeted metric fine-tuning. It represents an influential, highly efficient approach for generating dense metric depth maps from single RGB images in outdoor imagery, underpinned by the DINOv2 vision transformer backbone and DPT-style decoder. The model family covers a range of scales (from ~25M to 1.3B parameters), supporting high-resolution, low-latency inference and demonstrating robust generalization across benchmarks and applications including autonomous driving, remote sensing, and ecological monitoring.

1. Training Framework and Architectural Overview

Depth Anything V2 is constructed around a three-stage learning pipeline:

Synthetic-Only Teacher Training: The initial teacher model is trained purely on synthetic datasets (Hypersim, Virtual KITTI 2, BlendedMVS, IRS, TartanAir; ~595K images), where ground-truth metric depth is perfectly known. The backbone is DINOv2-G (1.3B params), coupled to a DPT decoder. Two primary loss terms are used: scale- and shift-invariant log error ( $\mathcal{L}_{ssi}$ ) and a gradient matching term ( $\mathcal{L}_{gm}$ ) to promote sharp boundaries.
Massive Pseudo-Label Generation: The frozen teacher infers inverse-depth on 62M web-scale, unlabeled real images spanning multiple domains, producing large-scale pseudo-labels for subsequent distillation.
Student Distillation and Metric Fine-Tuning: Lightweight student models (DINOv2-S, -B, -L, -G) are trained to regress the teacher's pseudo-labels with loss masking (top 10% high residual pixels excluded), enforcing robustness to label noise. To enable metric depth prediction for outdoor scenes, these students are then fine-tuned on Virtual KITTI 2 using a direct $L_1$ loss on meter-valued depth, optionally augmented with a scale-invariant penalty.

Architecturally, the model exploits a DINOv2 Transformer encoder and a DPT head without the introduction of domain-specific depth blocks or dependence on camera intrinsics at inference time (Yang et al., 2024).

2. Dataset Protocols and Outdoor Benchmarking

Metric-Outdoor models are specifically adapted to outdoor domains during fine-tuning. Key properties:

Training data: Virtual KITTI 2 is the primary dataset for metric adaptation, providing photo-realistic synthetic street scenes with ground-truth metric depth (depth up to 80 m).
Standard input: RGB images, typically KITTI resolution (e.g., 1242×375 for autonomous driving), with no modification to intrinsics required.
Test-time protocols: For most benchmarks, the predicted per-pixel depth map $D(u,v)$ is compared against ground-truth in meters after optional depth clipping (e.g., [1, 60]~m).

The pipeline produces dense metric depth maps suitable for 3D geometric back-projection, semantic 3D understanding, and direct integration into multi-modal object detection frameworks, such as pseudo-LiDAR downstream tasks (Ajadalu, 7 Jan 2026).

3. Evaluation Metrics and Empirical Performance

Common Metrics

Depth Anything V2 Metric-Outdoor, as per canonical evaluation, is assessed on:

Absolute Relative Error (AbsRel): $\text{AbsRel} = \frac{1}{N}\sum_i\frac{|d_i - d^*_i|}{d^*_i}$
Root Mean Square Error (RMSE): $\text{RMSE} = \sqrt{\frac{1}{N}\sum_i(d_i - d^*_i)^2}$
Log RMSE: $\text{RMSE}_{\log} = \sqrt{\frac{1}{N} \sum_i (\log d_i - \log d^*_i)^2}$
Threshold accuracy ( $\delta_t$ ): Fraction of pixels where $\max\left(\frac{d_i}{d^*_i}, \frac{d^*_i}{d_i}\right) < t$ for thresholds $t=1.25, 1.25^2, 1.25^3$

KITTI Outdoor (Metric Fine-tuned)

DINOv2-L student, fine-tuned on Virtual KITTI 2, yields:

Metric	Value
AbsRel	0.045
RMSE (m)	1.861
Log RMSE	0.067
$\mathcal{L}_{gm}$ 0	0.983
$\mathcal{L}_{gm}$ 1	0.998
$\mathcal{L}_{gm}$ 2	1.000

This is state-of-the-art among foundation-model approaches for outdoor metric depth (Yang et al., 2024).

Downstream 3D Detection (Pseudo-LiDAR)

When used as the depth backbone in a PointRCNN pipeline on KITTI (val, moderate, $\mathcal{L}_{gm}$ 3=0.7):

Method	AP_BEV (%)	AP_3D (%)
NeWCRFs + grayscale	11.72	10.50
DAV2 Metric-Outdoor (Base)	11.15	9.79

NeWCRFs outperforms DAV2 by approximately 0.7 percentage points in $\mathcal{L}_{gm}$ 4 (Ajadalu, 7 Jan 2026). Coarse object-centric "depth correctness" evaluated as percentage of boxes where $\mathcal{L}_{gm}$ 5~m shows that DAV2 achieves 33.4% for cars at 80m; lower for pedestrians and cyclists.

Wildlife Camera Trap Benchmark

In ground-truthed wildlife camera trap scenes, employing the recommended scale parameter $\mathcal{L}_{gm}$ 6 and median-based extraction, Depth Anything V2 achieves:

Method	MAE (m)	Corr	RelErr	RMSE (m)	Speed (s/img)
DAV2 (median)	0.454	0.962	0.211	0.593	0.22
ML Depth Pro	1.127	0.931	0.336	1.387	0.65
ZoeDepth	3.087	0.625	1.068	4.038	0.17
Metric3D v2	0.867	0.974	0.285	0.998	0.56

Median over bounding boxes is recommended to mitigate outlier errors (Niccoli et al., 6 Oct 2025).

Panoramic (360°) Depth Estimation

On the Metropolis panoramic outdoor benchmark:

Model	AbsRel	MAE	RMSE	RMSE log	$\mathcal{L}_{gm}$ 7
DAV2 (L)	0.2374	6.98	13.24	0.1565	64.59%
PanDA (L)	0.3219	6.99	11.40	0.2020	51.91%
DA360 (L)	0.2011	5.48	11.40	0.1413	76.36%

DA360, an adaptation of DAV2 with explicit shift-module and circular padding, outperforms the base model by 15.3% (AbsRel) and 11.8 pp ( $\mathcal{L}_{gm}$ 8) (Jiang et al., 28 Dec 2025).

4. Variants, Benchmarking, and Transfer

Depth Anything V2 underpins several subsequent specialized models:

Depth Any Canopy (DAC): Fine-tuned for tree canopy height estimation from aerial/satellite RGB (Cambrin et al., 2024). After fine-tuning, DAC-S achieves substantial gains in mean absolute error (–37–42%) and IoU (+28–42%) over SSL-H while using fewer compute resources.
Pseudo-LiDAR applications: Direct point-cloud back-projection via the standard pinhole model, with pre-inference depth clipping and optional secondary channels (grayscale, segmentation), enables 3D object detection pipelines.
DA360: Adapts DAV2 for equirectangular panoramic images using shift learning and circular padding, enabling robust 360° metric depth with strong outdoor generalization.
Wildlife/Ecological Camera-trap: Validated for reliable distance tagging (sub-meter MAE) using a field-calibrated, fixed scale factor, and median aggregation to suppress vegetation outliers.

5. Analysis, Limitations, and Best Practices

Generalization and Robustness

DAV2 demonstrates high accuracy and computational efficiency (0.22 s per 2k GPU image), supporting real-time operation in most embedded and robotics applications. Its success arises from the staged pretraining with synthetic depth, massive real-image pseudo-labels, and lightweight, dataset-specific fine-tuning, favoring both generalization and domain-specific adaptation.

Failure Modes and Open Challenges

Real-to-synthetic domain shift: Metric fine-tuning occurs predominantly on synthetic, photo-realistic datasets (e.g., VKITTI2), which still lack natural domain variability (crowds, weather).
Long-range and rare outdoor geometries: Occasional failure under extreme weather and rare scene types is observed; future improvements require expanded photorealistic data and domain adaptation (Yang et al., 2024).
Backbone sensitivity: Downstream pseudo-LiDAR 3D detection is more sensitive to backbone geometric fidelity than to auxiliary cues such as semantic masks or intensity channels (Ajadalu, 7 Jan 2026).

Practical Deployment

Outdoor calibration: Use the default scale parameter unless per-camera recalibration is feasible, preferably via ground-truthed patterns at known distances (Niccoli et al., 6 Oct 2025).
Robust extraction: Median pooling within semantic or geometric regions is recommended over means to minimize outlier impact (verified for vegetation-heavy scenes).
Resource efficiency: Small and base variants (DA-S, DA-B) offer favorable trade-off between GFLOPs and accuracy (e.g., DA-S: 24.8M params, 115 GFLOPs, zero-shot outdoor MAE $\mathcal{L}_{gm}$ 91 m) (Cambrin et al., 2024).

Relative to contemporaneous and related approaches:

Metric3Dv2: Employs canonical camera space transformation for zero-shot metric alignment across arbitrary cameras; achieves slightly lower AbsRel and RMSE on outdoor benchmarks at the cost of requiring explicit camera intrinsics (Hu et al., 2024).
SM $L_1$ 0Depth: Variation-based binning and divide-and-conquer domain estimation enable consistent cross-dataset accuracy, particularly for multi-camera, multi-scene pipelines (Liu et al., 2024).
ZoeDepth: Dual relative/metric regime with a metric bins module and automatic routing provides robust, high-accuracy zero-shot transfer but lags DAV2 in downstream real-world 3D detection and wildlife settings (Bhat et al., 2023, Niccoli et al., 6 Oct 2025).
Metric-Solver: Sliding anchor representations support open-range metric depth, enabling dynamic adjustment to context-specific outdoor scene scale (Wen et al., 16 Apr 2025).
Diffusion-based (MetricGold): Latent diffusion models with log-metric VAE encoding yield improved qualitative detail and scale stability, approaching DAV2’s quantitative metrics, but incur 10× slower inference and require re-training on synthetic data (Shah et al., 2024).

7. Summary Table: Outdoor Scene Performance

Model	AbsRel (KITTI)	RMSE (KITTI, m)	Wildlife MAE (m)	Pseudo-LiDAR $L_1$ 1 (%)	Panoramic AbsRel (Metropolis)
DAV2 (L)	0.045	1.86	0.454	9.79	0.2374
ZoeDepth	0.054	–	3.087	–	–
Metric3D v2	0.119–0.201	2.5–7.26	0.867	–	–
DA360 (L)	–	–	–	–	0.2011

Values as reported in (Yang et al., 2024, Ajadalu, 7 Jan 2026, Niccoli et al., 6 Oct 2025, Jiang et al., 28 Dec 2025).

Depth Anything V2 Metric-Outdoor constitutes a practical, efficient, and accurate backbone for monocular metric depth estimation and its downstream applications in outdoor vision, widely validated both as a standalone metric estimator and within broader 3D and remote sensing systems. Its modularity and scale-agnostic design make it a strong reference baseline for both academic research and field deployment.