- The paper introduces UniDepthV2, a simpler universal monocular metric depth estimation method that predicts 3D scenes from single images across diverse domains.
- The method utilizes a simplified architecture and novel losses, including an edge-guided loss, to improve accuracy and enhance the sharpness of predicted metric depth maps.
- Zero-shot evaluations demonstrate UniDepthV2's superior performance and strong generalization on ten diverse datasets compared to existing baselines.
The paper introduces UniDepthV2, denoted as \ourmodel (Universal Monocular Metric Depth Estimation), an evolved iteration of UniDepth, designed for universal monocular metric depth estimation (MMDE). \ourmodel aims to predict metric 3D scenes from single images across various domains, addressing the limited generalization of existing MMDE methods due to domain gaps. The approach involves direct prediction of metric 3D points from an input image at inference, foregoing additional information and thus striving for a universal and flexible MMDE solution.
Key elements of \ourmodel include:
- A self-promptable camera module that predicts a dense camera representation to condition depth features.
- A pseudo-spherical output representation that disentangles camera and depth representations using azimuth and elevation angles, along with depth (θ, ϕ, z).
- Where:
- θ is the azimuth angle
- ϕ is the elevation angle
- z is the depth
- A geometric invariance loss promoting invariance of camera-prompted depth features.
- An edge-guided loss to enhance the localization and sharpness of edges in metric depth outputs.
The method leverages a streamlined architectural design and incorporates an uncertainty-level output for downstream tasks requiring confidence. Zero-shot evaluations on ten depth datasets demonstrate the model's performance and generalization capabilities.
\ourmodel's design incorporates a camera module that outputs a dense camera representation, serving as a prompt to the depth module. A pseudo-spherical representation of the output space disentangles camera and depth dimensions. The pinhole-based camera representation is positionally encoded via a sine encoding, improving computational efficiency compared to the original UniDepth.
A geometric invariance loss enhances depth estimation robustness. This loss ensures that camera-conditioned depth outputs from two views of the same image exhibit reciprocal consistency. The model samples two geometric augmentations, creating different views for each training image to simulate different apparent cameras for the original scene. An uncertainty output and respective loss are included, with pixel-level uncertainties supervised by differences between depth predictions and ground-truth values, enabling applications requiring confidence-aware perception inputs.
Key contributions highlighted in the extended journal version include:
- A revisited architectural design of the camera-conditioned monocular metric depth estimator network, making \ourmodel simpler and more efficient. This involves simplifying connections between the Camera Module and the Depth Module, using sinusoidal embedding of pinhole-based dense camera representations, including multi-resolution features and convolutional layers in the depth decoder, and applying the geometric invariance loss solely on output-space features.
- A novel edge-guided scale-shift-invariant loss, computed from predicted and ground-truth depth maps around geometric edges, encourages \ourmodel to better preserve the local structure of the depth map and enhances the sharpness of depth outputs. The Edge-Guided Scale-Shift Invariant Loss (LEG−SSI) is formulated as:
$\mathcal{L}_{\mathrm{EG-SSI}(\mathbf{D}, \mathbf{D^*}, \Omega) = \sum_{\omega \in \Omega} \left|| \mathcal{N}_\omega (\mathbf{D}_{\omega}) - \mathcal{N}_\omega(\mathbf{D}^*_{\omega})\right||_1$
- Where:
- D is the predicted inverse depth.
- D∗ is the ground-truth inverse depth.
- Ω is the set of extracted RGB patches.
- Dω represents depth values within patch ω.
- Nω(⋅) denotes the standardization operation via subtracting the median and dividing by the mean absolute deviation (MAD) over the patch ω.
- An improved practical training strategy that presents the network with a greater diversity of input image shapes and resolutions within each mini-batch, leading to increased robustness to specific input distributions during inference.
- An additional, uncertainty-level output that allows for reliable quantification of confidence during inference.
The overall architecture of \ourmodel includes an Encoder Backbone, a Camera Module, and a Depth Module. The encoder is ViT-based, producing features at four different scales. The Camera Module parameters are initialized as class tokens, processed via self-attention layers, and embedded via a Sine encoding. The Depth Module is fed with feature maps from the encoder, conditioned on camera prompts, and processed with an FPN-style decoder.
The optimization process uses a re-formulation of the Mean Squared Error (MSE) loss in the final 3D output space (θ,ϕ,zlog) and the loss is defined as:
$\mathcal{L}_{\lambda\mathrm{MSE}(\bm{\varepsilon}) = \|\mathbb{V}[\bm{\varepsilon}]\|_1 + \bm{\lambda}^T(\mathbb{E}[\bm{\varepsilon}]\odot\mathbb{E}[\bm{\varepsilon}])$
- Where:
- ε=o−o∗∈R3^ is the difference between predicted and ground truth 3D outputs
- o=(θ^,ϕ^,z^log^) is the predicted 3D output
- o∗=(θ∗,ϕ∗,zlog∗) is the ground truth 3D value
- λ=(λθ,λϕ,λz)∈R3 is a vector of weights for each dimension of the output
- V[ε] and E[ε] are the vectors of empirical variances and means for each of the three output dimensions over all pixels
The overall loss function is defined as:
$\mathcal{L} = \mathcal{L}_{\lambda\mathrm{MSE} + \alpha \mathcal{L}_{\mathrm{con} + \beta \mathcal{L}_{\mathrm{EG-SSI} + \gamma \mathcal{L}_{\mathrm{L1},\ \text{ with } (\alpha,\beta,\gamma) =(0.1,1.0,0.1).}$
The training data includes a combination of 24 publicly available datasets, totaling 16M images. The generalizability of the models is evaluated on 8 datasets not seen during training, grouped into indoor and outdoor settings. Evaluation metrics include δ1SSI, FA, and ρA, with re-evaluation using a fair and consistent pipeline without test-time augmentations.
In zero-shot validation, \ourmodel performs better than existing baselines, even outperforming methods requiring ground-truth camera parameters at inference time. It shows significant improvement in 3D estimation, reflected in the FA metric. Ablation studies confirm the importance of each new component introduced in \ourmodel. The architectural modifications, including the removal of spherical harmonics-based encoding and the integration of multi-resolution feature fusion, contribute to improved efficiency and performance. The introduction of $\mathcal{L}_{\mathrm{EG-SSI}$ results in improvements in δ1 and FA, demonstrating its impact on metric accuracy and 3D estimation. Furthermore, experiments validate that the predicted confidence of \ourmodel negatively correlates with error, showing its reliability.