Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 96 tok/s
Gemini 3.0 Pro 48 tok/s Pro
Gemini 2.5 Flash 155 tok/s Pro
Kimi K2 197 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Depth Pro: Fast, Sharp Depth Estimation

Updated 15 November 2025
  • Depth Pro is a family of advanced vision algorithms that deliver fast, metric-accurate monocular depth estimation using a multi-scale Vision Transformer and self-calibrating focal-length prediction.
  • It employs a two-stage training curriculum on both real and synthetic data to enhance boundary precision and scale calibration without relying on camera intrinsics.
  • Quantitative results demonstrate state-of-the-art performance in runtime and boundary accuracy, making it ideal for deployment in robotics, AR/VR, and computer vision applications.

Depth Pro refers to a family of vision algorithms and models designed to deliver sharp, metric-accurate monocular depth estimation with high boundary fidelity, fast inference, and robust generalization to diverse inputs, without reliance on camera intrinsic parameters. The most recent foundation model under this name, “Depth Pro” (Bochkovskii et al., 2 Oct 2024), combines a multi-scale Vision Transformer architecture, a two-stage real-and-synthetic training curriculum, and a focal-length estimator to synthesize dense depth maps with high-frequency detail and absolute metric scale from a single RGB image. This methodology is motivated by the need for zero-shot, high-resolution, and metric depth prediction that is operationally deployable in settings where camera metadata may be missing or unreliable.

1. Multi-Scale Vision Transformer Architecture

Depth Pro processes input imagery at native high resolution (1536×1536 pixels) via a computationally efficient multi-scale Vision Transformer (ViT) backbone (ViT-L DINOv2). The architecture consists of:

  • Patch Encoder: The image is split into overlapping 384×384 patches at multiple scales. All patches are processed using shared ViT weights, which maximizes parallelism and model consistency.
  • Global Image Encoder: Applied to the whole image (downsampled to 384×384), this branch extracts holistic scene context to supplement patch-local information.
  • Decoder: Patch and global features are reassembled into a high-resolution “feature volume” and decoded using a DPT-style head, leveraging convolutional upsampling layers and skip connections for precise spatial synthesis.

This design achieves substantial speedups compared to scaling a single ViT to megapixel resolutions and enables sub-second inference (341 ms for 2.36 MP, 504 M parameters, 4.37 TFLOPs@HD).

2. Metric Depth Recovery and Focal-Length Estimation

Depth Pro’s output is a canonical inverse depth map CC (in units of disparity scaled to focal length). Metric depth is computed as

Dm=fpxwCD_{\mathrm{m}} = \frac{f_{\mathrm{px}}}{w} \, C

where fpxf_{\mathrm{px}} is the horizontal focal length in pixels (predicted by a dedicated “FOV head”), and ww is the image width (1536 px). The FOV head, trained independently after depth prediction, is critical for ensuring scale alignment without explicit camera intrinsics. It uses intermediate Depth Pro features (frozen) and additional ViT-based features, optimized by 2\ell_2 regression on focal length.

A plausible implication is that this decoupled focal length prediction effectively normalizes scale and allows direct metric interpretation of depth across disparate datasets and sources.

3. Training Losses and Curriculum

Depth Pro employs a two-stage training protocol:

  • Stage 1 (Generalization): Real and synthetic datasets are mixed. Losses include per-pixel MAE (with 20% worst-pixel trimming) on metric sets and scale-shift-invariant MAE on non-metric sets. Additionally, scale-shift-invariant gradient losses (LMAGE\mathcal{L}_{\text{MAGE}} using Scharr gradients) are applied on synthetic data to encourage sharp spatial derivatives.
  • Stage 2 (Detail Refinement): High-quality synthetic datasets only. Losses combine MAE, MSE, and first- and second-order multi-scale derivatives (LMAGE,LMALE,LMSGE\mathcal{L}_{\text{MAGE}}, \mathcal{L}_{\text{MALE}}, \mathcal{L}_{\text{MSGE}}) to harden boundary predictions and refine structural fidelity.

This curriculum improves boundary sharpness and topological detail retention versus monotonic or non-curriculum training, and optimal ablation settings are empirically validated in extended tables (Tables 15–17 of the source).

4. Boundary Accuracy Evaluation

Distinct from prior works, Depth Pro defines explicit boundary-sharpness metrics by proxying matting and segmentation masks as “ground truth” contours. For occlusion boundary evaluation:

  • Depth-derived contours: For neighboring pixel pairs (i,j)(i,j), a contour is declared at threshold t%t\% if d(j)/d(i)>1+t/100d(j)/d(i) > 1+t/100.
  • Binary mask-derived contours: Recalls are computed for predicted depth contours against known mask boundaries.

Depth Pro achieves state-of-the-art F1 and recall scores across all tested boundary benchmarks (e.g., F1=0.409 on Sintel vs. 0.321 for previous best; AM-2k recall=0.173 vs. 0.107 for prior).

5. Quantitative Results and Comparative Analysis

Depth Pro systematically outperforms prior “metric monocular depth” baselines on unseen datasets for both overall accuracy (δ1\delta_1, % within 25% of ground truth) and boundary metrics. Key numbers include:

Method Booster ETH3D Middleb. nuScenes Sintel Sun-RGBD Avg Rank
UniDepth 27.6 25.3 31.9 83.6 16.5 95.8 4.2
PatchFusion 22.6 51.8 49.9 20.4 14.0 53.6 5.2
Metric3D v2 39.4 87.7 29.9 82.6 38.3 75.6 3.7
Depth Pro 46.6 41.5 60.5 49.1 40.0 89.0 2.5

Boundary metrics similarly favor Depth Pro by a substantial margin. In terms of runtime, Depth Pro yields the highest native resolution and sharpest outputs at the fastest inference speed.

6. Ablation Studies and Design Decisions

Analysis of backbone selection, decoder choices, loss strategies, and curriculum scheduling asserts that:

  • ViT-L DINOv2 offers optimal tradeoff between absolute error and boundary fidelity.
  • Direct scaling of ViT to megapixel images increases latency and degrades boundary accuracy, confirming the efficacy of the patch-based multi-scale approach.
  • Exclusion of first/second-order derivative losses or reversal of training curriculum (synthetic before real) harms convergence and result quality.
  • The parallel DPT+ViT FOV head significantly surpasses serial or single-branch alternatives for focal-length prediction (~14% improvement).

This suggests that careful architectural modularity and loss scheduling are essential for generalization, sharpness, and scale calibration.

7. Practical Usage and Extension

Depth Pro supports immediate zero-shot deployment for metric depth estimation in computer vision, robotics, and AR/VR applications. Inputs are unconstrained RGB; outputs are dense, sharp metric depth maps and predicted focal lengths. No camera intrinsics, external calibration, or fine-tuning is required at test time. The open-source implementation at https://github.com/apple/ml-depth-pro enables adoption in research and production pipelines.

A plausible implication is that the Depth Pro methodology can be extended to fuse with semantic segmentation, 3D reconstruction, and spatial reasoning frameworks, given its plug-and-play architecture and ability to yield metric, boundary-precise depth at operational framerates.

Depth Pro should be distinguished from earlier “Depth Pro” approaches (Kaneko et al., 2019) centered on triangular patch-based representations and CNN–MLP hybrid architectures, which achieved parameter efficiency but do not possess the zero-shot, metric, and boundary-sharp capabilities of the modern ViT-based framework. These earlier variants are superseded in model performance, sharpness, and generalization by the multi-scale transformer design of the 2024–2025 version.

In summary, Depth Pro denotes a modern foundation model for fast, sharp, metric monocular depth estimation, integrating multi-scale transformer encoding, edge-preserving losses, and self-calibrating focal length prediction. It advances state-of-the-art accuracy and boundary metrics while enabling practical deployment without reliance on camera metadata.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Depth Pro.