Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 189 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 160 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Depth Anything 3: Advanced Depth Estimation Model

Updated 14 November 2025
  • Depth Anything 3 is an advanced vision transformer–based framework that unifies any-view geometry prediction with prompt-based metric adaptation.
  • It integrates a dual-prediction head to output dense depth and ray maps, employing token reordering for efficient cross-view reasoning.
  • The model uses a teacher-student curriculum with lightweight adapters, achieving state-of-the-art benchmarks across diverse fusion and reconstruction tasks.

Depth Anything 3 (DA3) refers to advanced vision transformer–based depth estimation models and associated frameworks that generalize the "Depth Anything" concept across challenging regimes—including any-view geometry prediction, metric depth via prompting, and unsupervised monocular learning in specialized domains. DA3 models unify advances in minimal transformer architectures, prompt-based metric adaptation, and fine-grained local feature refinement for highly robust spatial inference, achieving state-of-the-art performance over a spectrum of visual geometry benchmarks, specialized real-world datasets, and application domains.

1. Model Architecture and Minimal Modeling Principle

The DA3 model family is grounded in the assertion that a single plain Vision Transformer (ViT), typically DINOv2-pretrained, suffices as a universal backbone for visual geometry without architectural specialization. Unlike previous approaches that rely on cost volumes, multi-stage fusions, or complex attention schemas, DA3 leverages a "token reordering" paradigm:

  • For LL transformer blocks, the initial ss blocks apply standard self-attention independently to each input view's patch tokens.
  • The remaining gg blocks alternate between within-view and cross-view self-attention. By strategically reshaping the token tensor across blocks, cross-view reasoning is implemented without architectural modification.
  • When camera poses are provided, a learned "camera token"—generated from intrinsic and extrinsic parameters via a shallow MLP—is prepended to view tokens and included in all attention operations, providing pose conditioning with minimal overhead.

The DA3 head consists of a reassembly module (as in DPT) and bifurcated "Dual-DPT" fusion heads that output both:

  • A dense depth map D^RH×W\hat D\in\mathbb R^{H\times W}
  • A dense ray map M^RH×W×6\hat M\in\mathbb R^{H\times W\times 6}, thereby enabling unified geometry representation with maximal parameter sharing.

2. Depth-Ray Prediction, Supervision, and Loss Design

DA3 employs a depth-ray joint prediction target, wherein each input frame IiI_i (single or multi-view) yields per-pixel rays ri(u,v)=(ti,di(u,v))r_i(u,v)=(t_i, d_i(u,v)) with tit_i as the camera origin and di(u,v)=RiKi1pd_i(u,v)=R_i K_i^{-1} p as the ray direction (pp being the pixel in homogeneous coordinates). The predicted output comprises:

  • Depth map D^\hat D
  • Ray map M^\hat M

The 3D reconstruction for pixel (u,v)(u,v) in image ii is: Pi(u,v)=ti+Di(u,v)di(u,v)P_i(u,v) = t_i + D_i(u,v)\, d_i(u,v) The total loss aggregates depth, ray, and geometric consistency terms: L=Ldepth+Lray+Lpcd+LposeL = L_{\rm depth} + L_{\rm ray} + L_{\rm pcd} + L_{\rm pose} with each term promoting fidelity in its respective modality:

  • LdepthL_{\rm depth}: scale-normalizing robust loss, plus gradient matching.
  • LrayL_{\rm ray}: 1\ell_1-norm error to ray map supervision.
  • LpcdL_{\rm pcd}: 1\ell_1 error in 3D point cloud positions (reconstituted from depth and rays).
  • LposeL_{\rm pose}: pose loss (when applicable).

All loss terms are weighted equally, and the formulation supports training on both ground-truth and teacher-generated pseudo-labels.

3. Training Paradigms and Data Strategy

DA3 training utilizes a teacher-student curriculum. The teacher (DA3-Teacher) shares the architecture of the student but is trained exclusively on large-scale synthetic depth datasets, emphasizing scale-shift-invariant exponential depth. The teacher loss combines gradient, alignment, normal, and semantic (sky/object) terms.

For transfer to real-world data (LiDAR, COLMAP reconstructions), the teacher generates dense relative depth D~\tilde D, which is anchored to available metric reference DD via least squares scale–shift alignment: (s^,t^)=argmins>0,tpmp(sD~p+tDp)2(\hat s, \hat t) =\arg\min_{s>0,t}\sum_{p}m_p\bigl(s\,\tilde D_p + t - D_p\bigr)^2 The aligned depth DTM=s^D~+t^D^{T\to M} = \hat s\,\tilde D + \hat t serves as pseudo-labels for the student, which is in turn trained on a mixture of synthetic ground-truth and these real pseudo-labels. Pose conditioning is toggled stochastically to promote generalization across usage scenarios.

Training is conducted on a composite of purely public academic datasets, incorporating synthetic, structure-from-motion, and direct depth measurement modalities.

4. Prompting and Metric Depth via External Sensors

A distinctive DA3 paradigm (from (Lin et al., 18 Dec 2024)) extends the foundation model using external metric prompts (e.g., LiDAR). Rather than solely fine-tuning for scale, this approach fuses a sparse, metrically reliable depth map LL (e.g., from commodity LiDAR) into the decoder stages of a ViT+DPT foundation.

Multi-scale prompt fusion is realized as follows, per decoder stage ii:

  • Bilinear resizing of LL to match stage resolution, followed by two-layer convolutional transformation (g()g(\cdot)), with the final 1×1 convolution zero-initialized.
  • Either concatenation-plus-convolution with the image feature FiF_i:

Fi=σ(Wf[Fi;Pi]+bf)F_i' = \sigma(W_f [F_i; P_i] + b_f)

or simple additive fusion:

Fi=Fi+WpPiF_i' = F_i + W_p P_i

  • All prompt fusion blocks are lightweight, incurring ~5.7% compute overhead.

Training uses both synthetic and real data, including simulated LiDAR and pseudo-ground-truth generated from high-fidelity neural rendering (e.g., Zip-NeRF) plus precise (planar) ground-truth from hard sensors (FARO scanner). The loss function supervises predictions against both metric accuracy (from FARO) and high-frequency detail (from Zip-NeRF), using an edge-aware L1L_1 and gradient-based regularization.

This design enables DA3 to generate metric-accurate, 4K-resolution depth maps with minimal retraining and high-speed inference (e.g., ~20 FPS at 768×1024 for ViT-L).

5. Lightweight Adaptation and Fine-Tuning Strategies

Domain adaptation for specialized tasks (e.g., endoscopic UMDE) employs highly parameter-efficient modules:

  • RVLoRA adapters: Each transformer block adds two "Random-Vector Low-Rank Adaptation" (RVLoRA) modules, which freeze all pretrained weights, instead learning a rank-rr update via two small trainable matrices (A,BA, B) and two frozen random vectors (a,ba, b), such that:

h=W0x+ΛbBΛaAxh = W_0 x + \Lambda_b B \Lambda_a A x

with only A,BA, B optimized, providing scale adaptability in challenging, unseen depth domains.

  • Res-DSC blocks: Four residual convolutional blocks with depthwise separable convolutions inserted into the transformer hierarchy recover high-frequency local detail omitted by global transformers, restoring edge and texture fidelity without significant parameter count increase.

For endoscopic scenes, DA3 integrates with an intrinsic-image–based UMDE pipeline (IID-SfMLearner), using composite photometric, structural, and reflectance-based losses.

6. Benchmarks, Quantitative Results, and Application Domains

DA3 establishes state-of-the-art performance on a variety of geometry and depth estimation benchmarks. Notable metrics and results include:

Model Abs Rel ↓ RMSE (mm) ↓ δ ↑ Params (M) Trainable (M)
SfMLearner 0.086 7.553 0.925 31.6 31.6
Monodepth2 0.066 5.781 0.961 14.8 14.8
IID-SfM 0.058 4.820 0.969 14.8 14.8
DepthAnything 0.084 6.711 0.930 97.5 97.5
EndoDAC 0.052 4.464 0.979 99.1 1.66
DA3 0.048 4.172 0.982 98.8 1.38
  • On SCARED endoscopic data, DA3 achieves Abs Rel = 0.048, RMSE = 4.172 mm, δ = 0.982, while training only 1.38M parameters (≈1.4% of total).
  • On general benchmarks (HiRoom, ETH3D, DTU, etc.), DA3 surpasses prior SOTA VGGT by 44.3% (pose) and 25.1% (geometry), and outperforms DA2 in monocular depth estimates across most datasets.
  • 4K-resolution DA3 with metric prompts yields L1 ≈ 0.0132 and RMSE ≈ 0.0315 on ARKitScenes and improves TSDF fuse F-score from <0.66 (monocular) to 0.76.

Downstream application domains include:

  • 3D scene reconstruction (TSDF fusion with high geometric fidelity)
  • Robotic grasping (policies using DA3-predicted depth yield 80–90% grasp success on novel objects)
  • Augmented reality endoscopy and navigation (improving edge definition, pose stability, and spatial awareness in surgery)

7. Limitations, Robustness, and Future Directions

DA3's minimal transformer approach scales in view count (handling 1–4000+ inputs, depending on backbone and hardware) and domains (indoor/outdoor, objects/cities) but presents several open challenges:

  • Sensor limitations: Prompting with commodity LiDAR degrades at >2 m range; temporal depth flicker arises from sensor instability.
  • Model limitations: Despite structural simplicity, monocular DA3 occasionally underperforms DA2 in specific low-texture or dynamic scenes.
  • Directions: Ongoing work targets temporal prompt stability (cross-frame filtering/attention), incorporation of multimodal prompts (intrinsics, IMU), and fusion with generative/diffusion-style geometric heads for high-frequency detail and uncertainty quantification.

A plausible implication is that the minimal modeling principle and prompt–foundation fusion established in DA3 may generalize further, enabling unified visual geometry modeling across most downstream computer vision and robotics tasks using a single, versatile architecture.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Depth Anything 3 (DA3).