Depth Pro: Fast, Sharp Depth Estimation
- Depth Pro is a family of advanced vision algorithms that deliver fast, metric-accurate monocular depth estimation using a multi-scale Vision Transformer and self-calibrating focal-length prediction.
- It employs a two-stage training curriculum on both real and synthetic data to enhance boundary precision and scale calibration without relying on camera intrinsics.
- Quantitative results demonstrate state-of-the-art performance in runtime and boundary accuracy, making it ideal for deployment in robotics, AR/VR, and computer vision applications.
Depth Pro refers to a family of vision algorithms and models designed to deliver sharp, metric-accurate monocular depth estimation with high boundary fidelity, fast inference, and robust generalization to diverse inputs, without reliance on camera intrinsic parameters. The most recent foundation model under this name, “Depth Pro” (Bochkovskii et al., 2 Oct 2024), combines a multi-scale Vision Transformer architecture, a two-stage real-and-synthetic training curriculum, and a focal-length estimator to synthesize dense depth maps with high-frequency detail and absolute metric scale from a single RGB image. This methodology is motivated by the need for zero-shot, high-resolution, and metric depth prediction that is operationally deployable in settings where camera metadata may be missing or unreliable.
1. Multi-Scale Vision Transformer Architecture
Depth Pro processes input imagery at native high resolution (1536×1536 pixels) via a computationally efficient multi-scale Vision Transformer (ViT) backbone (ViT-L DINOv2). The architecture consists of:
- Patch Encoder: The image is split into overlapping 384×384 patches at multiple scales. All patches are processed using shared ViT weights, which maximizes parallelism and model consistency.
- Global Image Encoder: Applied to the whole image (downsampled to 384×384), this branch extracts holistic scene context to supplement patch-local information.
- Decoder: Patch and global features are reassembled into a high-resolution “feature volume” and decoded using a DPT-style head, leveraging convolutional upsampling layers and skip connections for precise spatial synthesis.
This design achieves substantial speedups compared to scaling a single ViT to megapixel resolutions and enables sub-second inference (341 ms for 2.36 MP, 504 M parameters, 4.37 TFLOPs@HD).
2. Metric Depth Recovery and Focal-Length Estimation
Depth Pro’s output is a canonical inverse depth map (in units of disparity scaled to focal length). Metric depth is computed as
where is the horizontal focal length in pixels (predicted by a dedicated “FOV head”), and is the image width (1536 px). The FOV head, trained independently after depth prediction, is critical for ensuring scale alignment without explicit camera intrinsics. It uses intermediate Depth Pro features (frozen) and additional ViT-based features, optimized by regression on focal length.
A plausible implication is that this decoupled focal length prediction effectively normalizes scale and allows direct metric interpretation of depth across disparate datasets and sources.
3. Training Losses and Curriculum
Depth Pro employs a two-stage training protocol:
- Stage 1 (Generalization): Real and synthetic datasets are mixed. Losses include per-pixel MAE (with 20% worst-pixel trimming) on metric sets and scale-shift-invariant MAE on non-metric sets. Additionally, scale-shift-invariant gradient losses ( using Scharr gradients) are applied on synthetic data to encourage sharp spatial derivatives.
- Stage 2 (Detail Refinement): High-quality synthetic datasets only. Losses combine MAE, MSE, and first- and second-order multi-scale derivatives () to harden boundary predictions and refine structural fidelity.
This curriculum improves boundary sharpness and topological detail retention versus monotonic or non-curriculum training, and optimal ablation settings are empirically validated in extended tables (Tables 15–17 of the source).
4. Boundary Accuracy Evaluation
Distinct from prior works, Depth Pro defines explicit boundary-sharpness metrics by proxying matting and segmentation masks as “ground truth” contours. For occlusion boundary evaluation:
- Depth-derived contours: For neighboring pixel pairs , a contour is declared at threshold if .
- Binary mask-derived contours: Recalls are computed for predicted depth contours against known mask boundaries.
Depth Pro achieves state-of-the-art F1 and recall scores across all tested boundary benchmarks (e.g., F1=0.409 on Sintel vs. 0.321 for previous best; AM-2k recall=0.173 vs. 0.107 for prior).
5. Quantitative Results and Comparative Analysis
Depth Pro systematically outperforms prior “metric monocular depth” baselines on unseen datasets for both overall accuracy (, % within 25% of ground truth) and boundary metrics. Key numbers include:
| Method | Booster | ETH3D | Middleb. | nuScenes | Sintel | Sun-RGBD | Avg Rank |
|---|---|---|---|---|---|---|---|
| UniDepth | 27.6 | 25.3 | 31.9 | 83.6 | 16.5 | 95.8 | 4.2 |
| PatchFusion | 22.6 | 51.8 | 49.9 | 20.4 | 14.0 | 53.6 | 5.2 |
| Metric3D v2 | 39.4 | 87.7 | 29.9 | 82.6 | 38.3 | 75.6 | 3.7 |
| Depth Pro | 46.6 | 41.5 | 60.5 | 49.1 | 40.0 | 89.0 | 2.5 |
Boundary metrics similarly favor Depth Pro by a substantial margin. In terms of runtime, Depth Pro yields the highest native resolution and sharpest outputs at the fastest inference speed.
6. Ablation Studies and Design Decisions
Analysis of backbone selection, decoder choices, loss strategies, and curriculum scheduling asserts that:
- ViT-L DINOv2 offers optimal tradeoff between absolute error and boundary fidelity.
- Direct scaling of ViT to megapixel images increases latency and degrades boundary accuracy, confirming the efficacy of the patch-based multi-scale approach.
- Exclusion of first/second-order derivative losses or reversal of training curriculum (synthetic before real) harms convergence and result quality.
- The parallel DPT+ViT FOV head significantly surpasses serial or single-branch alternatives for focal-length prediction (~14% improvement).
This suggests that careful architectural modularity and loss scheduling are essential for generalization, sharpness, and scale calibration.
7. Practical Usage and Extension
Depth Pro supports immediate zero-shot deployment for metric depth estimation in computer vision, robotics, and AR/VR applications. Inputs are unconstrained RGB; outputs are dense, sharp metric depth maps and predicted focal lengths. No camera intrinsics, external calibration, or fine-tuning is required at test time. The open-source implementation at https://github.com/apple/ml-depth-pro enables adoption in research and production pipelines.
A plausible implication is that the Depth Pro methodology can be extended to fuse with semantic segmentation, 3D reconstruction, and spatial reasoning frameworks, given its plug-and-play architecture and ability to yield metric, boundary-precise depth at operational framerates.
8. Context and Related Methods
Depth Pro should be distinguished from earlier “Depth Pro” approaches (Kaneko et al., 2019) centered on triangular patch-based representations and CNN–MLP hybrid architectures, which achieved parameter efficiency but do not possess the zero-shot, metric, and boundary-sharp capabilities of the modern ViT-based framework. These earlier variants are superseded in model performance, sharpness, and generalization by the multi-scale transformer design of the 2024–2025 version.
In summary, Depth Pro denotes a modern foundation model for fast, sharp, metric monocular depth estimation, integrating multi-scale transformer encoding, edge-preserving losses, and self-calibrating focal length prediction. It advances state-of-the-art accuracy and boundary metrics while enabling practical deployment without reliance on camera metadata.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free