Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 71 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 23 tok/s Pro

GPT-5 High 17 tok/s Pro

GPT-4o 111 tok/s Pro

Kimi K2 161 tok/s Pro

GPT OSS 120B 412 tok/s Pro

Claude Sonnet 4 35 tok/s Pro

2000 character limit reached

UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler (2502.20110v1)

Published 27 Feb 2025 in cs.CV

Abstract: Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepthV2, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE paradigm, UniDepthV2 directly predicts metric 3D points from the input image at inference time without any additional information, striving for a universal and flexible MMDE solution. In particular, UniDepthV2 implements a self-promptable camera module predicting a dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation, which disentangles the camera and depth representations. In addition, we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. UniDepthV2 improves its predecessor UniDepth model via a new edge-guided loss which enhances the localization and sharpness of edges in the metric depth outputs, a revisited, simplified and more efficient architectural design, and an additional uncertainty-level output which enables downstream tasks requiring confidence. Thorough evaluations on ten depth datasets in a zero-shot regime consistently demonstrate the superior performance and generalization of UniDepthV2. Code and models are available at https://github.com/lpiccinelli-eth/UniDepth

Summary

The paper introduces UniDepthV2, a simpler universal monocular metric depth estimation method that predicts 3D scenes from single images across diverse domains.
The method utilizes a simplified architecture and novel losses, including an edge-guided loss, to improve accuracy and enhance the sharpness of predicted metric depth maps.
Zero-shot evaluations demonstrate UniDepthV2's superior performance and strong generalization on ten diverse datasets compared to existing baselines.

The paper introduces UniDepthV2, denoted as \ourmodel (Universal Monocular Metric Depth Estimation), an evolved iteration of UniDepth, designed for universal monocular metric depth estimation (MMDE). \ourmodel aims to predict metric 3D scenes from single images across various domains, addressing the limited generalization of existing MMDE methods due to domain gaps. The approach involves direct prediction of metric 3D points from an input image at inference, foregoing additional information and thus striving for a universal and flexible MMDE solution.

Key elements of \ourmodel include:

A self-promptable camera module that predicts a dense camera representation to condition depth features.
A pseudo-spherical output representation that disentangles camera and depth representations using azimuth and elevation angles, along with depth ( $\theta$ $θ$ , $\phi$ $ϕ$ , $z$ $z$ ).
- Where:
- $\theta$ is the azimuth angle
- $\phi$ is the elevation angle
- $z$ is the depth
A geometric invariance loss promoting invariance of camera-prompted depth features.
An edge-guided loss to enhance the localization and sharpness of edges in metric depth outputs.

The method leverages a streamlined architectural design and incorporates an uncertainty-level output for downstream tasks requiring confidence. Zero-shot evaluations on ten depth datasets demonstrate the model's performance and generalization capabilities.

\ourmodel's design incorporates a camera module that outputs a dense camera representation, serving as a prompt to the depth module. A pseudo-spherical representation of the output space disentangles camera and depth dimensions. The pinhole-based camera representation is positionally encoded via a sine encoding, improving computational efficiency compared to the original UniDepth.

A geometric invariance loss enhances depth estimation robustness. This loss ensures that camera-conditioned depth outputs from two views of the same image exhibit reciprocal consistency. The model samples two geometric augmentations, creating different views for each training image to simulate different apparent cameras for the original scene. An uncertainty output and respective loss are included, with pixel-level uncertainties supervised by differences between depth predictions and ground-truth values, enabling applications requiring confidence-aware perception inputs.

Key contributions highlighted in the extended journal version include:

A revisited architectural design of the camera-conditioned monocular metric depth estimator network, making \ourmodel simpler and more efficient. This involves simplifying connections between the Camera Module and the Depth Module, using sinusoidal embedding of pinhole-based dense camera representations, including multi-resolution features and convolutional layers in the depth decoder, and applying the geometric invariance loss solely on output-space features.
A novel edge-guided scale-shift-invariant loss, computed from predicted and ground-truth depth maps around geometric edges, encourages \ourmodel to better preserve the local structure of the depth map and enhances the sharpness of depth outputs. The Edge-Guided Scale-Shift Invariant Loss ( $\mathcal{L}_{\mathrm{EG-SSI}}$ ) is formulated as:

$\mathcal{L}_{\mathrm{EG-SSI}(\mathbf{D}, \mathbf{D^*}, \Omega) = \sum_{\omega \in \Omega} \left|| \mathcal{N}_\omega (\mathbf{D}_{\omega}) - \mathcal{N}_\omega(\mathbf{D}^*_{\omega})\right||_1$

- Where: - $\mathbf{D}$ is the predicted inverse depth. - $\mathbf{D}^*$ is the ground-truth inverse depth. - $\Omega$ is the set of extracted RGB patches. - $\mathbf{D}_{\omega}$ represents depth values within patch $\omega$ . - $\mathcal{N}_\omega(\cdot)$ denotes the standardization operation via subtracting the median and dividing by the mean absolute deviation (MAD) over the patch $\omega$ .
An improved practical training strategy that presents the network with a greater diversity of input image shapes and resolutions within each mini-batch, leading to increased robustness to specific input distributions during inference.
An additional, uncertainty-level output that allows for reliable quantification of confidence during inference.

The overall architecture of \ourmodel includes an Encoder Backbone, a Camera Module, and a Depth Module. The encoder is ViT-based, producing features at four different scales. The Camera Module parameters are initialized as class tokens, processed via self-attention layers, and embedded via a Sine encoding. The Depth Module is fed with feature maps from the encoder, conditioned on camera prompts, and processed with an FPN-style decoder.

The optimization process uses a re-formulation of the Mean Squared Error (MSE) loss in the final 3D output space ( $\theta$ , $\phi$ , $z_{\log}$ ) and the loss is defined as:

$\mathcal{L}_{\lambda\mathrm{MSE}(\bm{\varepsilon}) = \|\mathbb{V}[\bm{\varepsilon}]\|_1 + \bm{\lambda}^T(\mathbb{E}[\bm{\varepsilon}]\odot\mathbb{E}[\bm{\varepsilon}])$

Where:
- $\bm{\varepsilon} = \hat{\mathbf{o} - \mathbf{o}^* \in \mathbb{R}^3}$ is the difference between predicted and ground truth 3D outputs
- $\hat{\mathbf{o}=(\hat{\theta},\hat{\phi},\hat{z}_{\log}})$ is the predicted 3D output
- $\mathbf{o}^*=(\theta^*,\phi^*,z_{\log}^*)$ is the ground truth 3D value
- $\bm{\lambda} = (\lambda_{\theta},\lambda_{\phi},\lambda_z) \in \mathbb{R}^3$ is a vector of weights for each dimension of the output
- $\mathbb{V}[\bm{\varepsilon}]$ and $\mathbb{E}[\bm{\varepsilon}]$ are the vectors of empirical variances and means for each of the three output dimensions over all pixels

The overall loss function is defined as:

$\mathcal{L} = \mathcal{L}_{\lambda\mathrm{MSE} + \alpha \mathcal{L}_{\mathrm{con} + \beta \mathcal{L}_{\mathrm{EG-SSI} + \gamma \mathcal{L}_{\mathrm{L1},\ \text{ with } (\alpha,\beta,\gamma) =(0.1,1.0,0.1).}$

The training data includes a combination of 24 publicly available datasets, totaling 16M images. The generalizability of the models is evaluated on 8 datasets not seen during training, grouped into indoor and outdoor settings. Evaluation metrics include $\mathrm{\delta_1^{SSI}}$ , $\mathrm{F_{A}}$ , and $\mathrm{\rho_{A}}$ , with re-evaluation using a fair and consistent pipeline without test-time augmentations.

In zero-shot validation, \ourmodel performs better than existing baselines, even outperforming methods requiring ground-truth camera parameters at inference time. It shows significant improvement in 3D estimation, reflected in the $\mathrm{F_A}$ metric. Ablation studies confirm the importance of each new component introduced in \ourmodel. The architectural modifications, including the removal of spherical harmonics-based encoding and the integration of multi-resolution feature fusion, contribute to improved efficiency and performance. The introduction of $\mathcal{L}_{\mathrm{EG-SSI}$ results in improvements in $\mathrm{\delta}_1$ and $\mathrm{F_A}$ , demonstrating its impact on metric accuracy and 3D estimation. Furthermore, experiments validate that the predicted confidence of \ourmodel negatively correlates with error, showing its reliability.