Depth Anything V2 Metric-Outdoor
- Depth Anything V2 Metric-Outdoor is a monocular metric depth estimation model for outdoor scenes that leverages synthetic pretraining, massive pseudo-labeling, and targeted fine-tuning.
- It utilizes a DINOv2 vision transformer backbone with a DPT-style decoder to achieve high-resolution, low-latency inference and state-of-the-art results in key outdoor benchmarks.
- The model is applicable to various domains including autonomous driving, remote sensing, and wildlife monitoring, providing dense, accurate metric depth maps from single RGB images.
Depth Anything V2 Metric-Outdoor
Depth Anything V2 Metric-Outdoor is a monocular metric depth estimation model optimized for outdoor scene understanding by leveraging a staged synthetic-pretraining and large-scale pseudo-labeling pipeline, followed by targeted metric fine-tuning. It represents an influential, highly efficient approach for generating dense metric depth maps from single RGB images in outdoor imagery, underpinned by the DINOv2 vision transformer backbone and DPT-style decoder. The model family covers a range of scales (from ~25M to 1.3B parameters), supporting high-resolution, low-latency inference and demonstrating robust generalization across benchmarks and applications including autonomous driving, remote sensing, and ecological monitoring.
1. Training Framework and Architectural Overview
Depth Anything V2 is constructed around a three-stage learning pipeline:
- Synthetic-Only Teacher Training: The initial teacher model is trained purely on synthetic datasets (Hypersim, Virtual KITTI 2, BlendedMVS, IRS, TartanAir; ~595K images), where ground-truth metric depth is perfectly known. The backbone is DINOv2-G (1.3B params), coupled to a DPT decoder. Two primary loss terms are used: scale- and shift-invariant log error () and a gradient matching term () to promote sharp boundaries.
- Massive Pseudo-Label Generation: The frozen teacher infers inverse-depth on 62M web-scale, unlabeled real images spanning multiple domains, producing large-scale pseudo-labels for subsequent distillation.
- Student Distillation and Metric Fine-Tuning: Lightweight student models (DINOv2-S, -B, -L, -G) are trained to regress the teacher's pseudo-labels with loss masking (top 10% high residual pixels excluded), enforcing robustness to label noise. To enable metric depth prediction for outdoor scenes, these students are then fine-tuned on Virtual KITTI 2 using a direct loss on meter-valued depth, optionally augmented with a scale-invariant penalty.
Architecturally, the model exploits a DINOv2 Transformer encoder and a DPT head without the introduction of domain-specific depth blocks or dependence on camera intrinsics at inference time (Yang et al., 2024).
2. Dataset Protocols and Outdoor Benchmarking
Metric-Outdoor models are specifically adapted to outdoor domains during fine-tuning. Key properties:
- Training data: Virtual KITTI 2 is the primary dataset for metric adaptation, providing photo-realistic synthetic street scenes with ground-truth metric depth (depth up to 80 m).
- Standard input: RGB images, typically KITTI resolution (e.g., 1242×375 for autonomous driving), with no modification to intrinsics required.
- Test-time protocols: For most benchmarks, the predicted per-pixel depth map is compared against ground-truth in meters after optional depth clipping (e.g., [1, 60]~m).
The pipeline produces dense metric depth maps suitable for 3D geometric back-projection, semantic 3D understanding, and direct integration into multi-modal object detection frameworks, such as pseudo-LiDAR downstream tasks (Ajadalu, 7 Jan 2026).
3. Evaluation Metrics and Empirical Performance
Common Metrics
Depth Anything V2 Metric-Outdoor, as per canonical evaluation, is assessed on:
- Absolute Relative Error (AbsRel):
- Root Mean Square Error (RMSE):
- Log RMSE:
- Threshold accuracy (): Fraction of pixels where for thresholds
KITTI Outdoor (Metric Fine-tuned)
DINOv2-L student, fine-tuned on Virtual KITTI 2, yields:
| Metric | Value |
|---|---|
| AbsRel | 0.045 |
| RMSE (m) | 1.861 |
| Log RMSE | 0.067 |
| 0.983 | |
| 0.998 | |
| 1.000 |
This is state-of-the-art among foundation-model approaches for outdoor metric depth (Yang et al., 2024).
Downstream 3D Detection (Pseudo-LiDAR)
When used as the depth backbone in a PointRCNN pipeline on KITTI (val, moderate, =0.7):
| Method | AP_BEV (%) | AP_3D (%) |
|---|---|---|
| NeWCRFs + grayscale | 11.72 | 10.50 |
| DAV2 Metric-Outdoor (Base) | 11.15 | 9.79 |
NeWCRFs outperforms DAV2 by approximately 0.7 percentage points in (Ajadalu, 7 Jan 2026). Coarse object-centric "depth correctness" evaluated as percentage of boxes where ~m shows that DAV2 achieves 33.4% for cars at 80m; lower for pedestrians and cyclists.
Wildlife Camera Trap Benchmark
In ground-truthed wildlife camera trap scenes, employing the recommended scale parameter and median-based extraction, Depth Anything V2 achieves:
| Method | MAE (m) | Corr | RelErr | RMSE (m) | Speed (s/img) |
|---|---|---|---|---|---|
| DAV2 (median) | 0.454 | 0.962 | 0.211 | 0.593 | 0.22 |
| ML Depth Pro | 1.127 | 0.931 | 0.336 | 1.387 | 0.65 |
| ZoeDepth | 3.087 | 0.625 | 1.068 | 4.038 | 0.17 |
| Metric3D v2 | 0.867 | 0.974 | 0.285 | 0.998 | 0.56 |
Median over bounding boxes is recommended to mitigate outlier errors (Niccoli et al., 6 Oct 2025).
Panoramic (360°) Depth Estimation
On the Metropolis panoramic outdoor benchmark:
| Model | AbsRel | MAE | RMSE | RMSE log | |
|---|---|---|---|---|---|
| DAV2 (L) | 0.2374 | 6.98 | 13.24 | 0.1565 | 64.59% |
| PanDA (L) | 0.3219 | 6.99 | 11.40 | 0.2020 | 51.91% |
| DA360 (L) | 0.2011 | 5.48 | 11.40 | 0.1413 | 76.36% |
DA360, an adaptation of DAV2 with explicit shift-module and circular padding, outperforms the base model by 15.3% (AbsRel) and 11.8 pp () (Jiang et al., 28 Dec 2025).
4. Variants, Benchmarking, and Transfer
Depth Anything V2 underpins several subsequent specialized models:
- Depth Any Canopy (DAC): Fine-tuned for tree canopy height estimation from aerial/satellite RGB (Cambrin et al., 2024). After fine-tuning, DAC-S achieves substantial gains in mean absolute error (–37–42%) and IoU (+28–42%) over SSL-H while using fewer compute resources.
- Pseudo-LiDAR applications: Direct point-cloud back-projection via the standard pinhole model, with pre-inference depth clipping and optional secondary channels (grayscale, segmentation), enables 3D object detection pipelines.
- DA360: Adapts DAV2 for equirectangular panoramic images using shift learning and circular padding, enabling robust 360° metric depth with strong outdoor generalization.
- Wildlife/Ecological Camera-trap: Validated for reliable distance tagging (sub-meter MAE) using a field-calibrated, fixed scale factor, and median aggregation to suppress vegetation outliers.
5. Analysis, Limitations, and Best Practices
Generalization and Robustness
DAV2 demonstrates high accuracy and computational efficiency (0.22 s per 2k GPU image), supporting real-time operation in most embedded and robotics applications. Its success arises from the staged pretraining with synthetic depth, massive real-image pseudo-labels, and lightweight, dataset-specific fine-tuning, favoring both generalization and domain-specific adaptation.
Failure Modes and Open Challenges
- Real-to-synthetic domain shift: Metric fine-tuning occurs predominantly on synthetic, photo-realistic datasets (e.g., VKITTI2), which still lack natural domain variability (crowds, weather).
- Long-range and rare outdoor geometries: Occasional failure under extreme weather and rare scene types is observed; future improvements require expanded photorealistic data and domain adaptation (Yang et al., 2024).
- Backbone sensitivity: Downstream pseudo-LiDAR 3D detection is more sensitive to backbone geometric fidelity than to auxiliary cues such as semantic masks or intensity channels (Ajadalu, 7 Jan 2026).
Practical Deployment
- Outdoor calibration: Use the default scale parameter unless per-camera recalibration is feasible, preferably via ground-truthed patterns at known distances (Niccoli et al., 6 Oct 2025).
- Robust extraction: Median pooling within semantic or geometric regions is recommended over means to minimize outlier impact (verified for vegetation-heavy scenes).
- Resource efficiency: Small and base variants (DA-S, DA-B) offer favorable trade-off between GFLOPs and accuracy (e.g., DA-S: 24.8M params, 115 GFLOPs, zero-shot outdoor MAE 1 m) (Cambrin et al., 2024).
6. Related Models and Comparative Landscape
Relative to contemporaneous and related approaches:
- Metric3Dv2: Employs canonical camera space transformation for zero-shot metric alignment across arbitrary cameras; achieves slightly lower AbsRel and RMSE on outdoor benchmarks at the cost of requiring explicit camera intrinsics (Hu et al., 2024).
- SMDepth: Variation-based binning and divide-and-conquer domain estimation enable consistent cross-dataset accuracy, particularly for multi-camera, multi-scene pipelines (Liu et al., 2024).
- ZoeDepth: Dual relative/metric regime with a metric bins module and automatic routing provides robust, high-accuracy zero-shot transfer but lags DAV2 in downstream real-world 3D detection and wildlife settings (Bhat et al., 2023, Niccoli et al., 6 Oct 2025).
- Metric-Solver: Sliding anchor representations support open-range metric depth, enabling dynamic adjustment to context-specific outdoor scene scale (Wen et al., 16 Apr 2025).
- Diffusion-based (MetricGold): Latent diffusion models with log-metric VAE encoding yield improved qualitative detail and scale stability, approaching DAV2’s quantitative metrics, but incur 10× slower inference and require re-training on synthetic data (Shah et al., 2024).
7. Summary Table: Outdoor Scene Performance
| Model | AbsRel (KITTI) | RMSE (KITTI, m) | Wildlife MAE (m) | Pseudo-LiDAR (%) | Panoramic AbsRel (Metropolis) |
|---|---|---|---|---|---|
| DAV2 (L) | 0.045 | 1.86 | 0.454 | 9.79 | 0.2374 |
| ZoeDepth | 0.054 | – | 3.087 | – | – |
| Metric3D v2 | 0.119–0.201 | 2.5–7.26 | 0.867 | – | – |
| DA360 (L) | – | – | – | – | 0.2011 |
Values as reported in (Yang et al., 2024, Ajadalu, 7 Jan 2026, Niccoli et al., 6 Oct 2025, Jiang et al., 28 Dec 2025).
Depth Anything V2 Metric-Outdoor constitutes a practical, efficient, and accurate backbone for monocular metric depth estimation and its downstream applications in outdoor vision, widely validated both as a standalone metric estimator and within broader 3D and remote sensing systems. Its modularity and scale-agnostic design make it a strong reference baseline for both academic research and field deployment.