Depth Anything 3: Advanced Depth Estimation Model

Updated 14 November 2025

Depth Anything 3 is an advanced vision transformer–based framework that unifies any-view geometry prediction with prompt-based metric adaptation.
It integrates a dual-prediction head to output dense depth and ray maps, employing token reordering for efficient cross-view reasoning.
The model uses a teacher-student curriculum with lightweight adapters, achieving state-of-the-art benchmarks across diverse fusion and reconstruction tasks.

Depth Anything 3 (DA3) refers to advanced vision transformer–based depth estimation models and associated frameworks that generalize the "Depth Anything" concept across challenging regimes—including any-view geometry prediction, metric depth via prompting, and unsupervised monocular learning in specialized domains. DA3 models unify advances in minimal transformer architectures, prompt-based metric adaptation, and fine-grained local feature refinement for highly robust spatial inference, achieving state-of-the-art performance over a spectrum of visual geometry benchmarks, specialized real-world datasets, and application domains.

1. Model Architecture and Minimal Modeling Principle

The DA3 model family is grounded in the assertion that a single plain Vision Transformer (ViT), typically DINOv2-pretrained, suffices as a universal backbone for visual geometry without architectural specialization. Unlike previous approaches that rely on cost volumes, multi-stage fusions, or complex attention schemas, DA3 leverages a "token reordering" paradigm:

For $L$ transformer blocks, the initial $s$ blocks apply standard self-attention independently to each input view's patch tokens.
The remaining $g$ blocks alternate between within-view and cross-view self-attention. By strategically reshaping the token tensor across blocks, cross-view reasoning is implemented without architectural modification.
When camera poses are provided, a learned "camera token"—generated from intrinsic and extrinsic parameters via a shallow MLP—is prepended to view tokens and included in all attention operations, providing pose conditioning with minimal overhead.

The DA3 head consists of a reassembly module (as in DPT) and bifurcated "Dual-DPT" fusion heads that output both:

A dense depth map $\hat D\in\mathbb R^{H\times W}$
A dense ray map $\hat M\in\mathbb R^{H\times W\times 6}$ , thereby enabling unified geometry representation with maximal parameter sharing.

2. Depth-Ray Prediction, Supervision, and Loss Design

DA3 employs a depth-ray joint prediction target, wherein each input frame $I_i$ (single or multi-view) yields per-pixel rays $r_i(u,v)=(t_i, d_i(u,v))$ with $t_i$ as the camera origin and $d_i(u,v)=R_i K_i^{-1} p$ as the ray direction ( $p$ being the pixel in homogeneous coordinates). The predicted output comprises:

Depth map $\hat D$
Ray map $\hat M$

The 3D reconstruction for pixel $(u,v)$ in image $i$ is: $P_i(u,v) = t_i + D_i(u,v)\, d_i(u,v)$ The total loss aggregates depth, ray, and geometric consistency terms: $L = L_{\rm depth} + L_{\rm ray} + L_{\rm pcd} + L_{\rm pose}$ with each term promoting fidelity in its respective modality:

$L_{\rm depth}$ : scale-normalizing robust loss, plus gradient matching.
$L_{\rm ray}$ : $\ell_1$ -norm error to ray map supervision.
$L_{\rm pcd}$ : $\ell_1$ error in 3D point cloud positions (reconstituted from depth and rays).
$L_{\rm pose}$ : pose loss (when applicable).

All loss terms are weighted equally, and the formulation supports training on both ground-truth and teacher-generated pseudo-labels.

3. Training Paradigms and Data Strategy

DA3 training utilizes a teacher-student curriculum. The teacher (DA3-Teacher) shares the architecture of the student but is trained exclusively on large-scale synthetic depth datasets, emphasizing scale-shift-invariant exponential depth. The teacher loss combines gradient, alignment, normal, and semantic (sky/object) terms.

For transfer to real-world data (LiDAR, COLMAP reconstructions), the teacher generates dense relative depth $\tilde D$ , which is anchored to available metric reference $D$ via least squares scale–shift alignment: $(\hat s, \hat t) =\arg\min_{s>0,t}\sum_{p}m_p\bigl(s\,\tilde D_p + t - D_p\bigr)^2$ The aligned depth $D^{T\to M} = \hat s\,\tilde D + \hat t$ serves as pseudo-labels for the student, which is in turn trained on a mixture of synthetic ground-truth and these real pseudo-labels. Pose conditioning is toggled stochastically to promote generalization across usage scenarios.

Training is conducted on a composite of purely public academic datasets, incorporating synthetic, structure-from-motion, and direct depth measurement modalities.

4. Prompting and Metric Depth via External Sensors

A distinctive DA3 paradigm (from (Lin et al., 18 Dec 2024)) extends the foundation model using external metric prompts (e.g., LiDAR). Rather than solely fine-tuning for scale, this approach fuses a sparse, metrically reliable depth map $L$ (e.g., from commodity LiDAR) into the decoder stages of a ViT+DPT foundation.

Multi-scale prompt fusion is realized as follows, per decoder stage $i$ :

Bilinear resizing of $L$ to match stage resolution, followed by two-layer convolutional transformation ( $g(\cdot)$ ), with the final 1×1 convolution zero-initialized.
Either concatenation-plus-convolution with the image feature $F_i$ :

$F_i' = \sigma(W_f [F_i; P_i] + b_f)$

or simple additive fusion:

$F_i' = F_i + W_p P_i$

All prompt fusion blocks are lightweight, incurring ~5.7% compute overhead.

Training uses both synthetic and real data, including simulated LiDAR and pseudo-ground-truth generated from high-fidelity neural rendering (e.g., Zip-NeRF) plus precise (planar) ground-truth from hard sensors (FARO scanner). The loss function supervises predictions against both metric accuracy (from FARO) and high-frequency detail (from Zip-NeRF), using an edge-aware $L_1$ and gradient-based regularization.

This design enables DA3 to generate metric-accurate, 4K-resolution depth maps with minimal retraining and high-speed inference (e.g., ~20 FPS at 768×1024 for ViT-L).

5. Lightweight Adaptation and Fine-Tuning Strategies

Domain adaptation for specialized tasks (e.g., endoscopic UMDE) employs highly parameter-efficient modules:

RVLoRA adapters: Each transformer block adds two "Random-Vector Low-Rank Adaptation" (RVLoRA) modules, which freeze all pretrained weights, instead learning a rank- $r$ update via two small trainable matrices ( $A, B$ ) and two frozen random vectors ( $a, b$ ), such that:

$h = W_0 x + \Lambda_b B \Lambda_a A x$

with only $A, B$ optimized, providing scale adaptability in challenging, unseen depth domains.

Res-DSC blocks: Four residual convolutional blocks with depthwise separable convolutions inserted into the transformer hierarchy recover high-frequency local detail omitted by global transformers, restoring edge and texture fidelity without significant parameter count increase.

For endoscopic scenes, DA3 integrates with an intrinsic-image–based UMDE pipeline (IID-SfMLearner), using composite photometric, structural, and reflectance-based losses.

6. Benchmarks, Quantitative Results, and Application Domains

DA3 establishes state-of-the-art performance on a variety of geometry and depth estimation benchmarks. Notable metrics and results include:

Model	Abs Rel ↓	RMSE (mm) ↓	δ ↑	Params (M)	Trainable (M)
SfMLearner	0.086	7.553	0.925	31.6	31.6
Monodepth2	0.066	5.781	0.961	14.8	14.8
IID-SfM	0.058	4.820	0.969	14.8	14.8
DepthAnything	0.084	6.711	0.930	97.5	97.5
EndoDAC	0.052	4.464	0.979	99.1	1.66
DA3	0.048	4.172	0.982	98.8	1.38

On SCARED endoscopic data, DA3 achieves Abs Rel = 0.048, RMSE = 4.172 mm, δ = 0.982, while training only 1.38M parameters (≈1.4% of total).
On general benchmarks (HiRoom, ETH3D, DTU, etc.), DA3 surpasses prior SOTA VGGT by 44.3% (pose) and 25.1% (geometry), and outperforms DA2 in monocular depth estimates across most datasets.
4K-resolution DA3 with metric prompts yields L1 ≈ 0.0132 and RMSE ≈ 0.0315 on ARKitScenes and improves TSDF fuse F-score from <0.66 (monocular) to 0.76.

Downstream application domains include:

3D scene reconstruction (TSDF fusion with high geometric fidelity)
Robotic grasping (policies using DA3-predicted depth yield 80–90% grasp success on novel objects)
Augmented reality endoscopy and navigation (improving edge definition, pose stability, and spatial awareness in surgery)

7. Limitations, Robustness, and Future Directions

DA3's minimal transformer approach scales in view count (handling 1–4000+ inputs, depending on backbone and hardware) and domains (indoor/outdoor, objects/cities) but presents several open challenges:

Sensor limitations: Prompting with commodity LiDAR degrades at >2 m range; temporal depth flicker arises from sensor instability.
Model limitations: Despite structural simplicity, monocular DA3 occasionally underperforms DA2 in specific low-texture or dynamic scenes.
Directions: Ongoing work targets temporal prompt stability (cross-frame filtering/attention), incorporation of multimodal prompts (intrinsics, IMU), and fusion with generative/diffusion-style geometric heads for high-frequency detail and uncertainty quantification.

A plausible implication is that the minimal modeling principle and prompt–foundation fusion established in DA3 may generalize further, enabling unified visual geometry modeling across most downstream computer vision and robotics tasks using a single, versatile architecture.

PDF Markdown Chat (Pro)

References (1)

Prompting Depth Anything for 4K Resolution Accurate Metric Depth Estimation (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Depth Anything 3 (DA3).