Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image (2307.10984v1)

Published 20 Jul 2023 in cs.CV and cs.AI

Abstract: Reconstructing accurate 3D scenes from images is a long-standing vision task. Due to the ill-posedness of the single-image reconstruction problem, most well-established methods are built upon multi-view geometry. State-of-the-art (SOTA) monocular metric depth estimation methods can only handle a single camera model and are unable to perform mixed-data training due to the metric ambiguity. Meanwhile, SOTA monocular methods trained on large mixed datasets achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. In this work, we show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models. Equipped with our module, monocular models can be stably trained with over 8 million images with thousands of camera models, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Experiments demonstrate SOTA performance of our method on 7 zero-shot benchmarks. Notably, our method won the championship in the 2nd Monocular Depth Estimation Challenge. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. The potential benefits extend to downstream tasks, which can be significantly improved by simply plugging in our model. For example, our model relieves the scale drift issues of monocular-SLAM (Fig. 1), leading to high-quality metric scale dense mapping. The code is available at https://github.com/YvanYin/Metric3D.

Citations (95)

Summary

  • The paper introduces a canonical camera transformation to resolve metric ambiguity across diverse camera models.
  • It leverages over 8 million images from 11 datasets to achieve robust zero-shot generalization for 3D depth prediction.
  • The novel Random Proposal Normalization Loss enhances local depth accuracy, yielding state-of-the-art performance on seven benchmarks.

Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image

The paper "Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image" addresses the challenge of reconstructing metric 3D scenes from single monocular images, a task traditionally hindered by the ill-posed nature of this problem. Existing monocular metric depth estimation methods fall short in their ability to generalize across diverse camera models and datasets due to metric ambiguity. This work proposes a novel approach to mitigate these limitations through a canonical camera transformation technique, enabling robust zero-shot generalization even in unstructured, in-the-wild scenarios.

Key Contributions

  1. Canonical Camera Transformation: The authors introduce a canonical camera space transformation module that resolves metric ambiguity across various camera models. This method transforms input images or depth labels to a canonical camera space during training, thus allowing the model to harness large-scale datasets with diverse camera intrinsic parameters.
  2. Large-Scale Data Utilization: By training over 8 million images from 11 datasets, encompassing thousands of camera models, the model achieves unprecedented zero-shot generalization capabilities. This extensive data amalgamation equips the model to handle unseen camera settings in real-world images effectively.
  3. Random Proposal Normalization Loss (RPNL): The authors enhance depth accuracy with a novel loss that amplifies local geometry by normalizing subsets of the image. This approach builds on existing scale-shift invariant losses but focuses on preserving local depth details.
  4. Numerical Results: Experiments demonstrate the model's state-of-the-art performance on seven benchmark zero-shot datasets. The technique notably won the championship in the 2nd Monocular Depth Estimation Challenge, underscoring its robustness and efficacy.

Implications and Future Directions

The implications of the proposed method are substantial in both practical and theoretical spheres. Practically, the model's ability to recover metric depths from internet-sourced images facilitates applications such as single-image metrology and significantly enhances the performance of dense SLAM systems by mitigating scale drift issues. Theoretically, this work advances toward addressing the longstanding challenge of transferring monocular depth models across varying camera settings without specific tuning or calibration.

Future developments in AI could explore further optimization of the transformation processes and integration with other modalities for richer scene understanding. Additionally, extending this approach to dynamic environments and integrating real-time capabilities could push the boundaries of applications in robotics and autonomous vehicles.

In summary, the paper presents a robust solution to mitigating metric ambiguity in monocular 3D prediction, setting a foundation for new research avenues in the field of computer vision and depth estimation from single images.