UniDepth v2: Universal Depth Estimation
- UniDepth v2 is a framework for monocular metric depth estimation that infers 3D scene geometry from single RGB images without the need for explicit camera calibration.
- It uses a vision transformer with a self-promptable camera module and efficient lateral connections to fuse multi-scale features for accurate depth prediction.
- Innovations such as edge-guided loss and per-pixel uncertainty estimation enable state-of-the-art zero-shot generalization across ten diverse benchmarks.
UniDepth v2 is a universal framework for monocular metric depth estimation that directly predicts 3D scene geometry from single RGB images across diverse domains, without requiring explicit camera calibration or extrinsic parameters. Building upon its predecessor’s self-promptable camera module, pseudo-spherical output representation, and geometric invariance loss, UniDepth v2 incorporates additional innovations including a simplified, more efficient architecture, an edge-guided loss for sharp boundary prediction, and per-pixel uncertainty estimation. The framework achieves state-of-the-art zero-shot generalization on ten benchmark datasets and delivers robust, confidence-aware depth maps suitable for a spectrum of 3D perception and robotics tasks.
1. Model Architecture and Self-Promptable Camera Module
The backbone of UniDepth v2 is a vision transformer encoder—pretrained with self-supervised protocols—tasked with extracting multi-scale features from input images. Unlike conventional metric depth estimators relying on externally supplied camera intrinsics, UniDepth v2 incorporates a self-promptable camera module that infers dense camera representations directly from image features.
For each image, the module regresses per-image pinhole camera parameters (focal lengths , ; principal point offsets , ) by computing multiplicative residuals over a default initialization (typically , for width/height). After predicting these, the module applies a back-projection to each pixel coordinate to yield a ray direction, from which the azimuth () and elevation () angles are computed for all pixels. These two channels undergo Sine encoding and are injected as prompts via cross-attention modules to the depth prediction head.
The depth module itself follows a feature pyramid structure incorporating lateral connections and utilizes computationally efficient blocks (e.g., ResNet-type convolutions) to merge global context with local geometric detail, conditioned at multiple scales on the dense camera prompts.
2. Pseudo-Spherical Output Representation and Metric Prediction
UniDepth v2 decouples camera calibration from depth by formulating the output space in pseudo-spherical coordinates:
Here, encode the ray direction, and encodes the logarithm of metric depth. This representation eliminates the entanglement of scene scale with camera parameters and enables the model to generalize to images acquired under unknown or varying intrinsic settings. The decoder predicts , which is exponentiated to retrieve the metric depth.
The output space enables loss formulations that separately address angular and radial errors. Specifically, the model’s re-formulated MSE in pseudo-spherical coordinates is:
where (predicted minus ground truth pseudo-spherical coordinates), and are variance and mean per dimension over pixels, and weights the angular and depth components.
3. Loss Functions: Geometric Invariance and Edge-Guided Depth Localization
UniDepth v2 employs multiple complementary loss functions tailored for robust and sharp metric prediction:
(a) Geometric Invariance Loss
This loss encourages depth feature invariance under geometric perturbations such as scaling and translations. Given two geometric augmentations , of the input, the invariance loss is:
where , are camera-conditioned depth predictions for each augmentation. The use of stop-gradient ensures reciprocal consistency while preventing degenerate solutions.
(b) Edge-Guided Normalized Loss (EG-SSI Loss)
Recognizing that monocular depth often blurs object boundaries, UniDepth v2 introduces an edge-guided loss that focuses learning on depth discontinuities. RGB image regions with maximal gradients (i.e., edges) are selected; predicted and ground-truth depth patches in these regions are locally normalized (subtracting median, dividing by median absolute deviation) and compared:
with as the set of edge-focused patches and denoting local normalization. This loss promotes sharp, shift- and scale-invariant transitions in depth.
(c) Uncertainty Estimation
An additional network head predicts per-pixel log-depth uncertainty, optimized using an norm to align predicted uncertainty with depth estimation error, yielding confidence maps suitable for downstream selection or fusion.
4. Performance Evaluation and Generalization
Evaluated across ten diverse datasets—including KITTI, NYU-Depth V2, ETH3D, SUN-RGBD, DIODE, DDAD, NuScenes, VOID, IBims-1, and HAMMER—UniDepth v2 surpasses prior approaches in zero-shot generalization. Key metrics include:
Metric | Description |
---|---|
Fraction of pixels with | |
Area-under-curve for 3D F1-score | |
Average angular error (camera ray) |
UniDepth v2 demonstrates improved and compared to both its predecessor and models requiring ground-truth camera intrinsics. Fine-tuning on standard benchmarks further augments in-domain accuracy, while uncertainty prediction enables robust application in safety-critical systems.
5. Comparative Context and Related Work
The architectural choices and evaluation protocol of UniDepth v2 distinguish it from concurrent monocular depth approaches:
- Depth Anything V2 replaces real with synthetic data annotation and relies on large-scale teacher–student pseudo-label transfer, scaling models from 25M to 1.3B parameters and achieving efficient, accurate depth prediction via affine-invariant and gradient-matching losses (Yang et al., 13 Jun 2024).
- Vanishing Depth augments generalized RGB encoders with a self-supervised depth adapter and positional depth encoding (PDE), yielding dense metric depth features in pretrained models without fine-tuning, demonstrating SOTA results on RGBD segmentation, depth completion, and 6D pose estimation tasks (Koch et al., 25 Mar 2025).
- UDGS-SLAM deploys UniDepth outputs within Gaussian splatting SLAM, using IQR-based statistical filtering for depth consistency and jointly optimizing camera and scene parameters via differentiable rendering losses (Mansour et al., 31 Aug 2024).
- 2T-UNet and Unified Perception investigate stereo and video-depth integration, but neither supports universal, zero-shot, camera-agnostic metric estimation from monocular imagery.
A plausible implication is that the disentangled camera representation and edge-guided learning pipeline in UniDepth v2 contribute to its resilience against domain gap and applicability in real-time and uncertainty-aware deployments.
6. Practical Applications and Impact
UniDepth v2 is positioned as a geometric foundation model for:
- Real-time 3D scene reconstruction in AR/VR and content creation.
- Autonomous vehicles and robotics, benefiting from camera-agnostic metric outputs and robust uncertainty estimation for safety-critical decision making.
- SLAM and mapping, utilizing the separation of camera geometry and depth for precise metric localization.
- Multi-modal sensor fusion, where confidence maps produced by UniDepth v2 facilitate integration with LiDAR, stereo, or other modalities.
The universal generalization and confidence estimation capabilities expand the model’s scope to previously impractical domains, supporting practical deployment in diverse and unknown environments.
7. Future Directions
Recommendations for advancing UniDepth v2 involve:
- Investigating further architecture simplifications for reduced inference latency.
- Expanding multi-task frameworks to jointly predict surface normals, semantic segmentation, or pose.
- Extending the edge-guided loss to model depth occlusion relationships and fine topological constraints.
- Integrating concepts from positional depth encoding and teacher–student scaling for improved robustness.
- Systematic evaluation on new benchmarks (e.g., DA-2K) with sparse, scenario-specific annotations to examine model behavior in adverse and underrepresented conditions (Yang et al., 13 Jun 2024).
An ongoing question is how scalable and interpretable the self-prompted camera representation remains as models or datasets grow, and whether additional disentanglement layers or cross-modal fusion mechanisms may further enhance universal monocular metric depth estimation.
In summary, UniDepth v2 combines architectural efficiency, domain-invariant camera prompting, and edge-localized metric depth supervision with practical uncertainty estimation, setting a new standard for universal monocular metric 3D scene modeling. Its empirical performance and extensible design mark it as a reference point for subsequent depth models and downstream 3D vision pipelines.