Metric3Dv2: Monocular 3D Reconstruction Model
- Metric3Dv2 is a geometric foundation model that generates metrically accurate depth maps and surface normal fields from a single RGB image, eliminating ambiguity in monocular vision.
- It leverages a Canonical Camera Space Transformation to standardize focal lengths and a recurrent GRU-based joint depth-normal optimization to refine predictions iteratively.
- Empirical results on benchmarks like NYUv2 and KITTI demonstrate its state-of-the-art performance in single-image 3D reconstruction, metrology, and downstream vision tasks.
Metric3Dv2 is a geometric foundation model developed for zero-shot metric depth and surface normal estimation from a single RGB image. It directly outputs metrically accurate depth maps and surface normal fields, overcoming the metric ambiguity intrinsic to monocular vision and delivering strong generalization across diverse scenes and cameras. Metric3Dv2 introduces architectural innovations—including the Canonical Camera Space Transformation and a joint depth-normal optimization framework—enabling robust performance for single-image 3D reconstruction, metrology, and as a plug-and-play module for downstream computer vision systems (Hu et al., 2024).
1. Architectural Overview and Objectives
Metric3Dv2 was designed to produce depth (in real-world metric units) and surface normals from a single RGB image, with a focus on zero-shot generalization to previously unseen scenes and camera calibrations. Two critical limitations in prior work are addressed:
- Depth metric ambiguity: Most monocular models output only affine-invariant depth—scale and shift factors unknown—which restricts real-world metric use.
- Normal estimation data scarcity: Limited availability of outdoor surface normal labels restricts generalization of state-of-the-art (SoTA) methods.
Metric3Dv2 resolves these through two core modules:
- Canonical Camera Space Transformation Module (CSTM): Explicitly removes the scale/translation ambiguity arising from variable focal lengths by mapping all inputs into a unified canonical camera space.
- Joint Depth–Normal Optimization Module: Integrates rich metric depth knowledge into the normal prediction pathway through iterative, learnable GRU-based refinement.
The overall network employs either ConvNeXt-Large or ViT (DINOv2) as backbone encoders, with parallel decoder heads for low-resolution predictions of canonical-space depth and unnormalized normals . These are refined via an iterative ConvGRU module , yielding final outputs after upsampling and normalization. At training, all images and, when needed, annotations are transformed by the CSTM into the canonical space; at inference, outputs are de-canonicalized to recover real-world metric geometry (Hu et al., 2024).
2. Canonical Camera Space Transformation (CSTM)
The principal challenge in monocular metric depth estimation is resolving the ambiguity introduced by unknown or variable camera intrinsic parameters. Under the pinhole camera model, the metric distance to an object is related to its real and image size and the focal length by:
where is real object size, is focal length (mm), and is its image size (mm). In pixels, with 0, 1 (pixel size 2), this gives:
3
Uncertainty or variability in 4 (per image or dataset) leads to “metric ambiguity.”
CSTM standardizes all training and evaluation data to a fixed “canonical” focal length 5 (e.g., 6 px), aligning all inputs to consistent intrinsics 7. Two equivalent strategies are used:
- CSTM_label: Scale ground-truth depth by 8 (no change to image).
- CSTM_image: Rescale the input image by 9, adjusting intrinsics and depths accordingly.
At inference, predicted depths are inversely scaled back. This eliminates the focal-length ambiguity and enables learning of true metric depth even with thousands of camera models in the training data, a capability confirmed by ablation studies (CSTM is strictly required for metrics to converge) (Hu et al., 2024).
3. Joint Depth–Normal Optimization
Supervised outdoor normal labels are scarce (on the order of 0), while depth data is abundantly available (116 million images). To address insufficient generalization in normal prediction, Metric3Dv2 couples the learning of depth and normals via joint optimization. The architecture (Fig. 7 in (Hu et al., 2024)) implements a recurrent ConvGRU:
For 2 with initial predictions 3 and hidden state 4,
5
6
7
This iterative process allows the model to distill structure from metric depth into surface normal predictions, robustly handling large-scale, mixed-annotation datasets.
The multi-task loss comprises:
8
with 9. 0 provides depth supervision (scale-invariant log loss, virtual normal loss, pair-wise planar consistency, and RPNL), 1 is an (aleatoric-uncertainty-aware) angular loss for normals (if GT exists), and 2 enforces self-supervised consistency between depth-derived and predicted normals (Hu et al., 2024).
4. Training Regimen and Evaluation Protocols
Training utilizes over 16M RGB images from 18 datasets spanning diverse scene types and thousands of cameras. Normal labels are sourced from subsets including ScanNet, Matterport3D, Taskonomy, Replica, and Hypersim. Batch composition is mixed to balance dataset influence, with a batch size of 192 distributed over 48 A100 GPUs. AdamW optimizer is used with poly learning-rate decay (initial 3), for 800k iterations. Augmentations include horizontal flip and random crop.
Zero-shot evaluation is performed on benchmarks such as NYUv2, KITTI, iBims-1, DIODE, ETH3D, NuScenes, and ScanNet—each with previously unseen images and intrinsics. Metrics include:
- Depth: AbsRel (4), RMSE, RMSE_log, and 5 (fraction with 6).
- Normals: Median angular error, fraction under angular thresholds (11.25°, 22.5°, 30°).
State-of-the-art results are reported, e.g., NYUv2 AbsRel 7, 8; KITTI AbsRel 9, 0 in zero-shot, as well as superior normal metrics on iBims-1 and ScanNet. Metric3Dv2 outperforms affine-invariant depth models (MiDaS, LeReS, DPT) in unaligned zero-shot evaluation (Hu et al., 2024).
5. Foundation Model Integration in Downstream Systems
Metric3Dv2 has been used as a frozen depth estimator in Bird’s Eye View (BEV) perception architectures, such as Lift-Splat-Shoot (LSS) and Simple-BEV, augmenting image-based pipelines with metric scene geometry (Hayes et al., 14 Jan 2025). In LSS, Metric3Dv2-generated depth maps 1 are quantized into 41-bin depth distributions 2 at coarse patch granularity and replace the conventional learned depth head. In Simple-BEV, per-pixel depths are backprojected using each camera’s intrinsics to produce PseudoLiDAR point clouds, fed into occupancy grids analogous to real LiDAR.
Empirical results show substantial improvement: integration of Metric3Dv2 with DINOv2 in LSS yields a 3 IoU gain over baseline EfficientNet heads (40.5 vs 33.0 IoU) on full data, and models trained on only half the data achieve higher IoU than the baseline trained on the full set. In Simple-BEV, PseudoLiDAR from Metric3Dv2 gives an absolute 4 IoU over camera-only inputs. Depth quality, rather than backbone feature size, is the dominant improvement factor. Ablations reveal diminishing returns with increased depth-map resolution, indicating the informative value of Metric3Dv2 outputs even at reduced spatial scales (Hayes et al., 14 Jan 2025).
6. Ablation Studies and Module Analyses
Comprehensive ablation studies demonstrate that:
- Absence of the CSTM results in failures to learn metric depth on mixed real-world data.
- CamConvs (intrinsics as input channels) are inferior to CSTM in metric generalization.
- Both CSTM_label and CSTM_image yield stable training, with a minor absolute error difference (5).
- RPNL incorporated into the loss function achieves further 6–7 AbsRel reduction over baseline depth losses.
- Joint depth-normal training: removing the depth branch causes normal estimation to collapse (flat predictions), while removing normal branch only modestly degrades depth accuracy. Not using mixed depth datasets or omitting consistency/GRU-based fusion significantly harms normal estimation.
- Simpler intermediate (unnormalized 3D vector) normal representations outperform more structured encodings.
- Optimal number of recurrent GRU steps depends on model scale (T=4 for ViT-S, T=8 for ViT-L/G).
These results substantiate the necessity of both canonicalization and joint iterative optimization for robust metric prediction and generalization (Hu et al., 2024).
7. Applications, Limitations, and Future Directions
Metric3Dv2 enables several applied scenarios:
- Single-image metric 3D reconstruction: Reconstructions from NYUv2 scenes show lower Chamfer-8 and higher F-score than affine or multi-view stereo baselines.
- Monocular SLAM: Integration into Droid-SLAM (on KITTI and ETH3D) reduces scale drift and translational drift by 9, supporting metrically accurate mapping.
- In-the-wild metrology: With known EXIF focal/intrinsics, objects in arbitrary images can be measured with errors below 0.
Limitations include the requirement for known per-image focal length (critical for CSTM), absence of explicit handling for fisheye/extreme wide-angle distortion, and reliance on depth annotations for large-scale generalization. Open directions are: blind focal estimation, end-to-end SLAM integration, improved lighting/reflectance modeling for normals, and expanded scale to 1M+ images (Hu et al., 2024).
In summary, Metric3Dv2 establishes a foundation for monocular, zero-shot metric depth and surface normal estimation, facilitated by an explicit camera-space canonicalization and joint optimization architecture. Its utility spans diverse computer vision and robotics applications, and it sets new performance benchmarks across a broad range of evaluation protocols (Hu et al., 2024, Hayes et al., 14 Jan 2025).