Papers
Topics
Authors
Recent
Search
2000 character limit reached

Metric3Dv2: Monocular 3D Reconstruction Model

Updated 2 June 2026
  • Metric3Dv2 is a geometric foundation model that generates metrically accurate depth maps and surface normal fields from a single RGB image, eliminating ambiguity in monocular vision.
  • It leverages a Canonical Camera Space Transformation to standardize focal lengths and a recurrent GRU-based joint depth-normal optimization to refine predictions iteratively.
  • Empirical results on benchmarks like NYUv2 and KITTI demonstrate its state-of-the-art performance in single-image 3D reconstruction, metrology, and downstream vision tasks.

Metric3Dv2 is a geometric foundation model developed for zero-shot metric depth and surface normal estimation from a single RGB image. It directly outputs metrically accurate depth maps and surface normal fields, overcoming the metric ambiguity intrinsic to monocular vision and delivering strong generalization across diverse scenes and cameras. Metric3Dv2 introduces architectural innovations—including the Canonical Camera Space Transformation and a joint depth-normal optimization framework—enabling robust performance for single-image 3D reconstruction, metrology, and as a plug-and-play module for downstream computer vision systems (Hu et al., 2024).

1. Architectural Overview and Objectives

Metric3Dv2 was designed to produce depth DD (in real-world metric units) and surface normals NN from a single RGB image, with a focus on zero-shot generalization to previously unseen scenes and camera calibrations. Two critical limitations in prior work are addressed:

  • Depth metric ambiguity: Most monocular models output only affine-invariant depth—scale and shift factors unknown—which restricts real-world metric use.
  • Normal estimation data scarcity: Limited availability of outdoor surface normal labels restricts generalization of state-of-the-art (SoTA) methods.

Metric3Dv2 resolves these through two core modules:

  1. Canonical Camera Space Transformation Module (CSTM): Explicitly removes the scale/translation ambiguity arising from variable focal lengths by mapping all inputs into a unified canonical camera space.
  2. Joint Depth–Normal Optimization Module: Integrates rich metric depth knowledge into the normal prediction pathway through iterative, learnable GRU-based refinement.

The overall network employs either ConvNeXt-Large or ViT (DINOv2) as backbone encoders, with parallel decoder heads for low-resolution predictions of canonical-space depth D^c0\hat D_c^0 and unnormalized normals N^u0\hat N_u^0. These are refined via an iterative ConvGRU module F\mathcal F, yielding final outputs after upsampling and normalization. At training, all images and, when needed, annotations are transformed by the CSTM into the canonical space; at inference, outputs are de-canonicalized to recover real-world metric geometry (Hu et al., 2024).

2. Canonical Camera Space Transformation (CSTM)

The principal challenge in monocular metric depth estimation is resolving the ambiguity introduced by unknown or variable camera intrinsic parameters. Under the pinhole camera model, the metric distance dd to an object is related to its real and image size and the focal length by:

d=S^f^S^d = \hat S \frac{\hat f}{\hat S'}

where S^\hat S is real object size, f^\hat f is focal length (mm), and S^\hat S' is its image size (mm). In pixels, with NN0, NN1 (pixel size NN2), this gives:

NN3

Uncertainty or variability in NN4 (per image or dataset) leads to “metric ambiguity.”

CSTM standardizes all training and evaluation data to a fixed “canonical” focal length NN5 (e.g., NN6 px), aligning all inputs to consistent intrinsics NN7. Two equivalent strategies are used:

  • CSTM_label: Scale ground-truth depth by NN8 (no change to image).
  • CSTM_image: Rescale the input image by NN9, adjusting intrinsics and depths accordingly.

At inference, predicted depths are inversely scaled back. This eliminates the focal-length ambiguity and enables learning of true metric depth even with thousands of camera models in the training data, a capability confirmed by ablation studies (CSTM is strictly required for metrics to converge) (Hu et al., 2024).

3. Joint Depth–Normal Optimization

Supervised outdoor normal labels are scarce (on the order of D^c0\hat D_c^00), while depth data is abundantly available (D^c0\hat D_c^0116 million images). To address insufficient generalization in normal prediction, Metric3Dv2 couples the learning of depth and normals via joint optimization. The architecture (Fig. 7 in (Hu et al., 2024)) implements a recurrent ConvGRU:

For D^c0\hat D_c^02 with initial predictions D^c0\hat D_c^03 and hidden state D^c0\hat D_c^04,

D^c0\hat D_c^05

D^c0\hat D_c^06

D^c0\hat D_c^07

This iterative process allows the model to distill structure from metric depth into surface normal predictions, robustly handling large-scale, mixed-annotation datasets.

The multi-task loss comprises:

D^c0\hat D_c^08

with D^c0\hat D_c^09. N^u0\hat N_u^00 provides depth supervision (scale-invariant log loss, virtual normal loss, pair-wise planar consistency, and RPNL), N^u0\hat N_u^01 is an (aleatoric-uncertainty-aware) angular loss for normals (if GT exists), and N^u0\hat N_u^02 enforces self-supervised consistency between depth-derived and predicted normals (Hu et al., 2024).

4. Training Regimen and Evaluation Protocols

Training utilizes over 16M RGB images from 18 datasets spanning diverse scene types and thousands of cameras. Normal labels are sourced from subsets including ScanNet, Matterport3D, Taskonomy, Replica, and Hypersim. Batch composition is mixed to balance dataset influence, with a batch size of 192 distributed over 48 A100 GPUs. AdamW optimizer is used with poly learning-rate decay (initial N^u0\hat N_u^03), for 800k iterations. Augmentations include horizontal flip and random crop.

Zero-shot evaluation is performed on benchmarks such as NYUv2, KITTI, iBims-1, DIODE, ETH3D, NuScenes, and ScanNet—each with previously unseen images and intrinsics. Metrics include:

  • Depth: AbsRel (N^u0\hat N_u^04), RMSE, RMSE_log, and N^u0\hat N_u^05 (fraction with N^u0\hat N_u^06).
  • Normals: Median angular error, fraction under angular thresholds (11.25°, 22.5°, 30°).

State-of-the-art results are reported, e.g., NYUv2 AbsRel N^u0\hat N_u^07, N^u0\hat N_u^08; KITTI AbsRel N^u0\hat N_u^09, F\mathcal F0 in zero-shot, as well as superior normal metrics on iBims-1 and ScanNet. Metric3Dv2 outperforms affine-invariant depth models (MiDaS, LeReS, DPT) in unaligned zero-shot evaluation (Hu et al., 2024).

5. Foundation Model Integration in Downstream Systems

Metric3Dv2 has been used as a frozen depth estimator in Bird’s Eye View (BEV) perception architectures, such as Lift-Splat-Shoot (LSS) and Simple-BEV, augmenting image-based pipelines with metric scene geometry (Hayes et al., 14 Jan 2025). In LSS, Metric3Dv2-generated depth maps F\mathcal F1 are quantized into 41-bin depth distributions F\mathcal F2 at coarse patch granularity and replace the conventional learned depth head. In Simple-BEV, per-pixel depths are backprojected using each camera’s intrinsics to produce PseudoLiDAR point clouds, fed into occupancy grids analogous to real LiDAR.

Empirical results show substantial improvement: integration of Metric3Dv2 with DINOv2 in LSS yields a F\mathcal F3 IoU gain over baseline EfficientNet heads (40.5 vs 33.0 IoU) on full data, and models trained on only half the data achieve higher IoU than the baseline trained on the full set. In Simple-BEV, PseudoLiDAR from Metric3Dv2 gives an absolute F\mathcal F4 IoU over camera-only inputs. Depth quality, rather than backbone feature size, is the dominant improvement factor. Ablations reveal diminishing returns with increased depth-map resolution, indicating the informative value of Metric3Dv2 outputs even at reduced spatial scales (Hayes et al., 14 Jan 2025).

6. Ablation Studies and Module Analyses

Comprehensive ablation studies demonstrate that:

  • Absence of the CSTM results in failures to learn metric depth on mixed real-world data.
  • CamConvs (intrinsics as input channels) are inferior to CSTM in metric generalization.
  • Both CSTM_label and CSTM_image yield stable training, with a minor absolute error difference (F\mathcal F5).
  • RPNL incorporated into the loss function achieves further F\mathcal F6–F\mathcal F7 AbsRel reduction over baseline depth losses.
  • Joint depth-normal training: removing the depth branch causes normal estimation to collapse (flat predictions), while removing normal branch only modestly degrades depth accuracy. Not using mixed depth datasets or omitting consistency/GRU-based fusion significantly harms normal estimation.
  • Simpler intermediate (unnormalized 3D vector) normal representations outperform more structured encodings.
  • Optimal number of recurrent GRU steps depends on model scale (T=4 for ViT-S, T=8 for ViT-L/G).

These results substantiate the necessity of both canonicalization and joint iterative optimization for robust metric prediction and generalization (Hu et al., 2024).

7. Applications, Limitations, and Future Directions

Metric3Dv2 enables several applied scenarios:

  • Single-image metric 3D reconstruction: Reconstructions from NYUv2 scenes show lower Chamfer-F\mathcal F8 and higher F-score than affine or multi-view stereo baselines.
  • Monocular SLAM: Integration into Droid-SLAM (on KITTI and ETH3D) reduces scale drift and translational drift by F\mathcal F9, supporting metrically accurate mapping.
  • In-the-wild metrology: With known EXIF focal/intrinsics, objects in arbitrary images can be measured with errors below dd0.

Limitations include the requirement for known per-image focal length (critical for CSTM), absence of explicit handling for fisheye/extreme wide-angle distortion, and reliance on depth annotations for large-scale generalization. Open directions are: blind focal estimation, end-to-end SLAM integration, improved lighting/reflectance modeling for normals, and expanded scale to dd1M+ images (Hu et al., 2024).

In summary, Metric3Dv2 establishes a foundation for monocular, zero-shot metric depth and surface normal estimation, facilitated by an explicit camera-space canonicalization and joint optimization architecture. Its utility spans diverse computer vision and robotics applications, and it sets new performance benchmarks across a broad range of evaluation protocols (Hu et al., 2024, Hayes et al., 14 Jan 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Metric3Dv2.