Metric3Dv2: Monocular 3D Reconstruction Model

Updated 2 June 2026

Metric3Dv2 is a geometric foundation model that generates metrically accurate depth maps and surface normal fields from a single RGB image, eliminating ambiguity in monocular vision.
It leverages a Canonical Camera Space Transformation to standardize focal lengths and a recurrent GRU-based joint depth-normal optimization to refine predictions iteratively.
Empirical results on benchmarks like NYUv2 and KITTI demonstrate its state-of-the-art performance in single-image 3D reconstruction, metrology, and downstream vision tasks.

Metric3Dv2 is a geometric foundation model developed for zero-shot metric depth and surface normal estimation from a single RGB image. It directly outputs metrically accurate depth maps and surface normal fields, overcoming the metric ambiguity intrinsic to monocular vision and delivering strong generalization across diverse scenes and cameras. Metric3Dv2 introduces architectural innovations—including the Canonical Camera Space Transformation and a joint depth-normal optimization framework—enabling robust performance for single-image 3D reconstruction, metrology, and as a plug-and-play module for downstream computer vision systems (Hu et al., 2024).

1. Architectural Overview and Objectives

Metric3Dv2 was designed to produce depth $D$ (in real-world metric units) and surface normals $N$ from a single RGB image, with a focus on zero-shot generalization to previously unseen scenes and camera calibrations. Two critical limitations in prior work are addressed:

Depth metric ambiguity: Most monocular models output only affine-invariant depth—scale and shift factors unknown—which restricts real-world metric use.
Normal estimation data scarcity: Limited availability of outdoor surface normal labels restricts generalization of state-of-the-art (SoTA) methods.

Metric3Dv2 resolves these through two core modules:

Canonical Camera Space Transformation Module (CSTM): Explicitly removes the scale/translation ambiguity arising from variable focal lengths by mapping all inputs into a unified canonical camera space.
Joint Depth–Normal Optimization Module: Integrates rich metric depth knowledge into the normal prediction pathway through iterative, learnable GRU-based refinement.

The overall network employs either ConvNeXt-Large or ViT (DINOv2) as backbone encoders, with parallel decoder heads for low-resolution predictions of canonical-space depth $\hat D_c^0$ and unnormalized normals $\hat N_u^0$ . These are refined via an iterative ConvGRU module $\mathcal F$ , yielding final outputs after upsampling and normalization. At training, all images and, when needed, annotations are transformed by the CSTM into the canonical space; at inference, outputs are de-canonicalized to recover real-world metric geometry (Hu et al., 2024).

2. Canonical Camera Space Transformation (CSTM)

The principal challenge in monocular metric depth estimation is resolving the ambiguity introduced by unknown or variable camera intrinsic parameters. Under the pinhole camera model, the metric distance $d$ to an object is related to its real and image size and the focal length by:

$d = \hat S \frac{\hat f}{\hat S'}$

where $\hat S$ is real object size, $\hat f$ is focal length (mm), and $\hat S'$ is its image size (mm). In pixels, with $N$ 0, $N$ 1 (pixel size $N$ 2), this gives:

$N$ 3

Uncertainty or variability in $N$ 4 (per image or dataset) leads to “metric ambiguity.”

CSTM standardizes all training and evaluation data to a fixed “canonical” focal length $N$ 5 (e.g., $N$ 6 px), aligning all inputs to consistent intrinsics $N$ 7. Two equivalent strategies are used:

CSTM_label: Scale ground-truth depth by $N$ 8 (no change to image).
CSTM_image: Rescale the input image by $N$ 9, adjusting intrinsics and depths accordingly.

At inference, predicted depths are inversely scaled back. This eliminates the focal-length ambiguity and enables learning of true metric depth even with thousands of camera models in the training data, a capability confirmed by ablation studies (CSTM is strictly required for metrics to converge) (Hu et al., 2024).

3. Joint Depth–Normal Optimization

Supervised outdoor normal labels are scarce (on the order of $\hat D_c^0$ 0), while depth data is abundantly available ( $\hat D_c^0$ 116 million images). To address insufficient generalization in normal prediction, Metric3Dv2 couples the learning of depth and normals via joint optimization. The architecture (Fig. 7 in (Hu et al., 2024)) implements a recurrent ConvGRU:

For $\hat D_c^0$ 2 with initial predictions $\hat D_c^0$ 3 and hidden state $\hat D_c^0$ 4,

$\hat D_c^0$ 5

$\hat D_c^0$ 6

$\hat D_c^0$ 7

This iterative process allows the model to distill structure from metric depth into surface normal predictions, robustly handling large-scale, mixed-annotation datasets.

The multi-task loss comprises:

$\hat D_c^0$ 8

with $\hat D_c^0$ 9. $\hat N_u^0$ 0 provides depth supervision (scale-invariant log loss, virtual normal loss, pair-wise planar consistency, and RPNL), $\hat N_u^0$ 1 is an (aleatoric-uncertainty-aware) angular loss for normals (if GT exists), and $\hat N_u^0$ 2 enforces self-supervised consistency between depth-derived and predicted normals (Hu et al., 2024).

4. Training Regimen and Evaluation Protocols

Training utilizes over 16M RGB images from 18 datasets spanning diverse scene types and thousands of cameras. Normal labels are sourced from subsets including ScanNet, Matterport3D, Taskonomy, Replica, and Hypersim. Batch composition is mixed to balance dataset influence, with a batch size of 192 distributed over 48 A100 GPUs. AdamW optimizer is used with poly learning-rate decay (initial $\hat N_u^0$ 3), for 800k iterations. Augmentations include horizontal flip and random crop.

Zero-shot evaluation is performed on benchmarks such as NYUv2, KITTI, iBims-1, DIODE, ETH3D, NuScenes, and ScanNet—each with previously unseen images and intrinsics. Metrics include:

Depth: AbsRel ( $\hat N_u^0$ 4), RMSE, RMSE_log, and $\hat N_u^0$ 5 (fraction with $\hat N_u^0$ 6).
Normals: Median angular error, fraction under angular thresholds (11.25°, 22.5°, 30°).

State-of-the-art results are reported, e.g., NYUv2 AbsRel $\hat N_u^0$ 7, $\hat N_u^0$ 8; KITTI AbsRel $\hat N_u^0$ 9, $\mathcal F$ 0 in zero-shot, as well as superior normal metrics on iBims-1 and ScanNet. Metric3Dv2 outperforms affine-invariant depth models (MiDaS, LeReS, DPT) in unaligned zero-shot evaluation (Hu et al., 2024).

5. Foundation Model Integration in Downstream Systems

Metric3Dv2 has been used as a frozen depth estimator in Bird’s Eye View (BEV) perception architectures, such as Lift-Splat-Shoot (LSS) and Simple-BEV, augmenting image-based pipelines with metric scene geometry (Hayes et al., 14 Jan 2025). In LSS, Metric3Dv2-generated depth maps $\mathcal F$ 1 are quantized into 41-bin depth distributions $\mathcal F$ 2 at coarse patch granularity and replace the conventional learned depth head. In Simple-BEV, per-pixel depths are backprojected using each camera’s intrinsics to produce PseudoLiDAR point clouds, fed into occupancy grids analogous to real LiDAR.

Empirical results show substantial improvement: integration of Metric3Dv2 with DINOv2 in LSS yields a $\mathcal F$ 3 IoU gain over baseline EfficientNet heads (40.5 vs 33.0 IoU) on full data, and models trained on only half the data achieve higher IoU than the baseline trained on the full set. In Simple-BEV, PseudoLiDAR from Metric3Dv2 gives an absolute $\mathcal F$ 4 IoU over camera-only inputs. Depth quality, rather than backbone feature size, is the dominant improvement factor. Ablations reveal diminishing returns with increased depth-map resolution, indicating the informative value of Metric3Dv2 outputs even at reduced spatial scales (Hayes et al., 14 Jan 2025).

6. Ablation Studies and Module Analyses

Comprehensive ablation studies demonstrate that:

Absence of the CSTM results in failures to learn metric depth on mixed real-world data.
CamConvs (intrinsics as input channels) are inferior to CSTM in metric generalization.
Both CSTM_label and CSTM_image yield stable training, with a minor absolute error difference ( $\mathcal F$ 5).
RPNL incorporated into the loss function achieves further $\mathcal F$ 6– $\mathcal F$ 7 AbsRel reduction over baseline depth losses.
Joint depth-normal training: removing the depth branch causes normal estimation to collapse (flat predictions), while removing normal branch only modestly degrades depth accuracy. Not using mixed depth datasets or omitting consistency/GRU-based fusion significantly harms normal estimation.
Simpler intermediate (unnormalized 3D vector) normal representations outperform more structured encodings.
Optimal number of recurrent GRU steps depends on model scale (T=4 for ViT-S, T=8 for ViT-L/G).

These results substantiate the necessity of both canonicalization and joint iterative optimization for robust metric prediction and generalization (Hu et al., 2024).

7. Applications, Limitations, and Future Directions

Metric3Dv2 enables several applied scenarios:

Single-image metric 3D reconstruction: Reconstructions from NYUv2 scenes show lower Chamfer- $\mathcal F$ 8 and higher F-score than affine or multi-view stereo baselines.
Monocular SLAM: Integration into Droid-SLAM (on KITTI and ETH3D) reduces scale drift and translational drift by $\mathcal F$ 9, supporting metrically accurate mapping.
In-the-wild metrology: With known EXIF focal/intrinsics, objects in arbitrary images can be measured with errors below $d$ 0.

Limitations include the requirement for known per-image focal length (critical for CSTM), absence of explicit handling for fisheye/extreme wide-angle distortion, and reliance on depth annotations for large-scale generalization. Open directions are: blind focal estimation, end-to-end SLAM integration, improved lighting/reflectance modeling for normals, and expanded scale to $d$ 1M+ images (Hu et al., 2024).

In summary, Metric3Dv2 establishes a foundation for monocular, zero-shot metric depth and surface normal estimation, facilitated by an explicit camera-space canonicalization and joint optimization architecture. Its utility spans diverse computer vision and robotics applications, and it sets new performance benchmarks across a broad range of evaluation protocols (Hu et al., 2024, Hayes et al., 14 Jan 2025).

Markdown Report Issue Upgrade to Chat

References (2)

Metric3Dv2: A Versatile Monocular Geometric Foundation Model for Zero-shot Metric Depth and Surface Normal Estimation (2024)

Revisiting Birds Eye View Perception Models with Frozen Foundation Models: DINOv2 and Metric3Dv2 (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Metric3Dv2.

Metric3Dv2: Monocular 3D Reconstruction Model

1. Architectural Overview and Objectives

2. Canonical Camera Space Transformation (CSTM)

3. Joint Depth–Normal Optimization

4. Training Regimen and Evaluation Protocols

5. Foundation Model Integration in Downstream Systems

6. Ablation Studies and Module Analyses

7. Applications, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Metric3Dv2: Monocular 3D Reconstruction Model

1. Architectural Overview and Objectives

2. Canonical Camera Space Transformation (CSTM)

3. Joint Depth–Normal Optimization

4. Training Regimen and Evaluation Protocols

5. Foundation Model Integration in Downstream Systems

6. Ablation Studies and Module Analyses

7. Applications, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research