Papers
Topics
Authors
Recent
Search
2000 character limit reached

Depth Anything V2 Metric-Outdoor

Updated 14 January 2026
  • Depth Anything V2 Metric-Outdoor is a monocular metric depth estimation model for outdoor scenes that leverages synthetic pretraining, massive pseudo-labeling, and targeted fine-tuning.
  • It utilizes a DINOv2 vision transformer backbone with a DPT-style decoder to achieve high-resolution, low-latency inference and state-of-the-art results in key outdoor benchmarks.
  • The model is applicable to various domains including autonomous driving, remote sensing, and wildlife monitoring, providing dense, accurate metric depth maps from single RGB images.

Depth Anything V2 Metric-Outdoor

Depth Anything V2 Metric-Outdoor is a monocular metric depth estimation model optimized for outdoor scene understanding by leveraging a staged synthetic-pretraining and large-scale pseudo-labeling pipeline, followed by targeted metric fine-tuning. It represents an influential, highly efficient approach for generating dense metric depth maps from single RGB images in outdoor imagery, underpinned by the DINOv2 vision transformer backbone and DPT-style decoder. The model family covers a range of scales (from ~25M to 1.3B parameters), supporting high-resolution, low-latency inference and demonstrating robust generalization across benchmarks and applications including autonomous driving, remote sensing, and ecological monitoring.

1. Training Framework and Architectural Overview

Depth Anything V2 is constructed around a three-stage learning pipeline:

  1. Synthetic-Only Teacher Training: The initial teacher model is trained purely on synthetic datasets (Hypersim, Virtual KITTI 2, BlendedMVS, IRS, TartanAir; ~595K images), where ground-truth metric depth is perfectly known. The backbone is DINOv2-G (1.3B params), coupled to a DPT decoder. Two primary loss terms are used: scale- and shift-invariant log error (Lssi\mathcal{L}_{ssi}) and a gradient matching term (Lgm\mathcal{L}_{gm}) to promote sharp boundaries.
  2. Massive Pseudo-Label Generation: The frozen teacher infers inverse-depth on 62M web-scale, unlabeled real images spanning multiple domains, producing large-scale pseudo-labels for subsequent distillation.
  3. Student Distillation and Metric Fine-Tuning: Lightweight student models (DINOv2-S, -B, -L, -G) are trained to regress the teacher's pseudo-labels with loss masking (top 10% high residual pixels excluded), enforcing robustness to label noise. To enable metric depth prediction for outdoor scenes, these students are then fine-tuned on Virtual KITTI 2 using a direct L1L_1 loss on meter-valued depth, optionally augmented with a scale-invariant penalty.

Architecturally, the model exploits a DINOv2 Transformer encoder and a DPT head without the introduction of domain-specific depth blocks or dependence on camera intrinsics at inference time (Yang et al., 2024).

2. Dataset Protocols and Outdoor Benchmarking

Metric-Outdoor models are specifically adapted to outdoor domains during fine-tuning. Key properties:

  • Training data: Virtual KITTI 2 is the primary dataset for metric adaptation, providing photo-realistic synthetic street scenes with ground-truth metric depth (depth up to 80 m).
  • Standard input: RGB images, typically KITTI resolution (e.g., 1242×375 for autonomous driving), with no modification to intrinsics required.
  • Test-time protocols: For most benchmarks, the predicted per-pixel depth map D(u,v)D(u,v) is compared against ground-truth in meters after optional depth clipping (e.g., [1, 60]~m).

The pipeline produces dense metric depth maps suitable for 3D geometric back-projection, semantic 3D understanding, and direct integration into multi-modal object detection frameworks, such as pseudo-LiDAR downstream tasks (Ajadalu, 7 Jan 2026).

3. Evaluation Metrics and Empirical Performance

Common Metrics

Depth Anything V2 Metric-Outdoor, as per canonical evaluation, is assessed on:

  • Absolute Relative Error (AbsRel): AbsRel=1Nidididi\text{AbsRel} = \frac{1}{N}\sum_i\frac{|d_i - d^*_i|}{d^*_i}
  • Root Mean Square Error (RMSE): RMSE=1Ni(didi)2\text{RMSE} = \sqrt{\frac{1}{N}\sum_i(d_i - d^*_i)^2}
  • Log RMSE: RMSElog=1Ni(logdilogdi)2\text{RMSE}_{\log} = \sqrt{\frac{1}{N} \sum_i (\log d_i - \log d^*_i)^2}
  • Threshold accuracy (δt\delta_t): Fraction of pixels where max(didi,didi)<t\max\left(\frac{d_i}{d^*_i}, \frac{d^*_i}{d_i}\right) < t for thresholds t=1.25,1.252,1.253t=1.25, 1.25^2, 1.25^3

KITTI Outdoor (Metric Fine-tuned)

DINOv2-L student, fine-tuned on Virtual KITTI 2, yields:

Metric Value
AbsRel 0.045
RMSE (m) 1.861
Log RMSE 0.067
δ1\delta_1 0.983
δ2\delta_2 0.998
δ3\delta_3 1.000

This is state-of-the-art among foundation-model approaches for outdoor metric depth (Yang et al., 2024).

Downstream 3D Detection (Pseudo-LiDAR)

When used as the depth backbone in a PointRCNN pipeline on KITTI (val, moderate, IoUIoU=0.7):

Method AP_BEV (%) AP_3D (%)
NeWCRFs + grayscale 11.72 10.50
DAV2 Metric-Outdoor (Base) 11.15 9.79

NeWCRFs outperforms DAV2 by approximately 0.7 percentage points in AP3DAP_{3D} (Ajadalu, 7 Jan 2026). Coarse object-centric "depth correctness" evaluated as percentage of boxes where dpreddgt1.5|d_\text{pred}-d_\text{gt}| \leq 1.5~m shows that DAV2 achieves 33.4% for cars at 80m; lower for pedestrians and cyclists.

Wildlife Camera Trap Benchmark

In ground-truthed wildlife camera trap scenes, employing the recommended scale parameter α=20\alpha=20 and median-based extraction, Depth Anything V2 achieves:

Method MAE (m) Corr RelErr RMSE (m) Speed (s/img)
DAV2 (median) 0.454 0.962 0.211 0.593 0.22
ML Depth Pro 1.127 0.931 0.336 1.387 0.65
ZoeDepth 3.087 0.625 1.068 4.038 0.17
Metric3D v2 0.867 0.974 0.285 0.998 0.56

Median over bounding boxes is recommended to mitigate outlier errors (Niccoli et al., 6 Oct 2025).

Panoramic (360°) Depth Estimation

On the Metropolis panoramic outdoor benchmark:

Model AbsRel MAE RMSE RMSE log δ1\delta_1
DAV2 (L) 0.2374 6.98 13.24 0.1565 64.59%
PanDA (L) 0.3219 6.99 11.40 0.2020 51.91%
DA360 (L) 0.2011 5.48 11.40 0.1413 76.36%

DA360, an adaptation of DAV2 with explicit shift-module and circular padding, outperforms the base model by 15.3% (AbsRel) and 11.8 pp (δ1\delta_1) (Jiang et al., 28 Dec 2025).

4. Variants, Benchmarking, and Transfer

Depth Anything V2 underpins several subsequent specialized models:

  • Depth Any Canopy (DAC): Fine-tuned for tree canopy height estimation from aerial/satellite RGB (Cambrin et al., 2024). After fine-tuning, DAC-S achieves substantial gains in mean absolute error (–37–42%) and IoU (+28–42%) over SSL-H while using fewer compute resources.
  • Pseudo-LiDAR applications: Direct point-cloud back-projection via the standard pinhole model, with pre-inference depth clipping and optional secondary channels (grayscale, segmentation), enables 3D object detection pipelines.
  • DA360: Adapts DAV2 for equirectangular panoramic images using shift learning and circular padding, enabling robust 360° metric depth with strong outdoor generalization.
  • Wildlife/Ecological Camera-trap: Validated for reliable distance tagging (sub-meter MAE) using a field-calibrated, fixed scale factor, and median aggregation to suppress vegetation outliers.

5. Analysis, Limitations, and Best Practices

Generalization and Robustness

DAV2 demonstrates high accuracy and computational efficiency (0.22 s per 2k GPU image), supporting real-time operation in most embedded and robotics applications. Its success arises from the staged pretraining with synthetic depth, massive real-image pseudo-labels, and lightweight, dataset-specific fine-tuning, favoring both generalization and domain-specific adaptation.

Failure Modes and Open Challenges

  • Real-to-synthetic domain shift: Metric fine-tuning occurs predominantly on synthetic, photo-realistic datasets (e.g., VKITTI2), which still lack natural domain variability (crowds, weather).
  • Long-range and rare outdoor geometries: Occasional failure under extreme weather and rare scene types is observed; future improvements require expanded photorealistic data and domain adaptation (Yang et al., 2024).
  • Backbone sensitivity: Downstream pseudo-LiDAR 3D detection is more sensitive to backbone geometric fidelity than to auxiliary cues such as semantic masks or intensity channels (Ajadalu, 7 Jan 2026).

Practical Deployment

  • Outdoor calibration: Use the default scale parameter unless per-camera recalibration is feasible, preferably via ground-truthed patterns at known distances (Niccoli et al., 6 Oct 2025).
  • Robust extraction: Median pooling within semantic or geometric regions is recommended over means to minimize outlier impact (verified for vegetation-heavy scenes).
  • Resource efficiency: Small and base variants (DA-S, DA-B) offer favorable trade-off between GFLOPs and accuracy (e.g., DA-S: 24.8M params, 115 GFLOPs, zero-shot outdoor MAE <<1 m) (Cambrin et al., 2024).

Relative to contemporaneous and related approaches:

  • Metric3Dv2: Employs canonical camera space transformation for zero-shot metric alignment across arbitrary cameras; achieves slightly lower AbsRel and RMSE on outdoor benchmarks at the cost of requiring explicit camera intrinsics (Hu et al., 2024).
  • SM4^4Depth: Variation-based binning and divide-and-conquer domain estimation enable consistent cross-dataset accuracy, particularly for multi-camera, multi-scene pipelines (Liu et al., 2024).
  • ZoeDepth: Dual relative/metric regime with a metric bins module and automatic routing provides robust, high-accuracy zero-shot transfer but lags DAV2 in downstream real-world 3D detection and wildlife settings (Bhat et al., 2023, Niccoli et al., 6 Oct 2025).
  • Metric-Solver: Sliding anchor representations support open-range metric depth, enabling dynamic adjustment to context-specific outdoor scene scale (Wen et al., 16 Apr 2025).
  • Diffusion-based (MetricGold): Latent diffusion models with log-metric VAE encoding yield improved qualitative detail and scale stability, approaching DAV2’s quantitative metrics, but incur 10× slower inference and require re-training on synthetic data (Shah et al., 2024).

7. Summary Table: Outdoor Scene Performance

Model AbsRel (KITTI) RMSE (KITTI, m) Wildlife MAE (m) Pseudo-LiDAR AP3DAP_{3D} (%) Panoramic AbsRel (Metropolis)
DAV2 (L) 0.045 1.86 0.454 9.79 0.2374
ZoeDepth 0.054 3.087
Metric3D v2 0.119–0.201 2.5–7.26 0.867
DA360 (L) 0.2011

Values as reported in (Yang et al., 2024, Ajadalu, 7 Jan 2026, Niccoli et al., 6 Oct 2025, Jiang et al., 28 Dec 2025).

Depth Anything V2 Metric-Outdoor constitutes a practical, efficient, and accurate backbone for monocular metric depth estimation and its downstream applications in outdoor vision, widely validated both as a standalone metric estimator and within broader 3D and remote sensing systems. Its modularity and scale-agnostic design make it a strong reference baseline for both academic research and field deployment.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Depth Anything V2 Metric-Outdoor.