Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision (2410.19115v3)

Published 24 Oct 2024 in cs.CV

Abstract: We present MoGe, a powerful model for recovering 3D geometry from monocular open-domain images. Given a single image, our model directly predicts a 3D point map of the captured scene with an affine-invariant representation, which is agnostic to true global scale and shift. This new representation precludes ambiguous supervision in training and facilitate effective geometry learning. Furthermore, we propose a set of novel global and local geometry supervisions that empower the model to learn high-quality geometry. These include a robust, optimal, and efficient point cloud alignment solver for accurate global shape learning, and a multi-scale local geometry loss promoting precise local geometry supervision. We train our model on a large, mixed dataset and demonstrate its strong generalizability and high accuracy. In our comprehensive evaluation on diverse unseen datasets, our model significantly outperforms state-of-the-art methods across all tasks, including monocular estimation of 3D point map, depth map, and camera field of view. Code and models can be found on our project page.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Building rome in a day. Communications of the ACM, 54(10):105–112, 2011.
  2. Ntire 2017 challenge on single image super-resolution: Dataset and study. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017.
  3. ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  4. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4009–4018, 2021.
  5. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
  6. Midas v3. 1–a model zoo for robust monocular relative depth estimation. arXiv preprint arXiv:2307.14460, 2023.
  7. Deepcalib: A deep learning approach for automatic intrinsic calibration of wide field-of-view cameras. In Proceedings of the 15th ACM SIGGRAPH European Conference on Visual Media Production, pages 1–10, 2018.
  8. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.
  9. A naturalistic open source movie for optical flow evaluation. In European Conf. on Computer Vision (ECCV), pages 611–625. Springer-Verlag, 2012.
  10. Single-image depth perception in the wild. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2016.
  11. Oasis: A large-scale dataset for single image 3d in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  12. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
  13. Automatic camera calibration from a single manhattan image. In Computer Vision—ECCV 2002: 7th European Conference on Computer Vision Copenhagen, Denmark, May 28–31, 2002 Proceedings, Part IV 7, pages 175–188. Springer, 2002.
  14. Digital Image Media Laboratory (DIML) and Computer Vision Laboratory (CVL). Diml/cvl rgb-d dataset: 2m rgb-d images of natural indoor and outdoor scenes. https://dimlrgbd.github.io/downloads/technical_report.pdf.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  16. Google scanned objects: A high-quality dataset of 3d scanned household items, 2022.
  17. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10786–10796, 2021.
  18. Depth map prediction from a single image using a multi-scale deep network. Advances in neural information processing systems, 27, 2014.
  19. Mid-air: A multi-modal dataset for extremely low altitude drone flights. In Conference on Computer Vision and Pattern Recognition Workshop (CVPRW), 2019.
  20. Geowizard: Unleashing the diffusion priors for 3d geometry estimation from a single image. arXiv preprint arXiv:2403.12013, 2024.
  21. A2D2: Audi Autonomous Driving Dataset. 2020.
  22. Digging into self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3828–3838, 2019.
  23. Depthfm: Fast monocular depth estimation with flow matching. arXiv preprint arXiv:2403.13788, 2024.
  24. 3d packing for self-supervised monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  25. All for one, and one for all: Urbansyn dataset, the third musketeer of synthetic driving scenes, 2023.
  26. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation. arXiv preprint arXiv:2404.15506, 2024.
  27. Deepmvs: Learning multi-view stereopsis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  28. Perspective fields for single image camera calibration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17307–17316, 2023.
  29. Repurposing diffusion-based image generators for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9492–9502, 2024.
  30. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
  31. Comparison of monocular depth estimation methods using geometrically relevant metrics on the ibims-1 dataset. Computer Vision and Image Understanding (CVIU), 191:102877, 2020.
  32. Ctrl-c: Camera calibration transformer with line-classification. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16228–16237, 2021.
  33. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023.
  34. Megadepth: Learning single-view depth prediction from internet photos. In Computer Vision and Pattern Recognition (CVPR), 2018.
  35. Binsformer: Revisiting adaptive bins for monocular depth estimation. IEEE Transactions on Image Processing, 2024.
  36. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  37. Jorge J Moré. The levenberg-marquardt algorithm: implementation and theory. In Numerical analysis: proceedings of the biennial Conference held at Dundee, June 28–July 1, 1977, pages 105–116. Springer, 2006.
  38. Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  39. 3d ken burns effect from a single image. ACM Transactions on Graphics, 38(6):184:1–184:15, 2019.
  40. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  41. Unidepth: Universal monocular metric depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10106–10116, 2024.
  42. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  43. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021.
  44. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In International Conference on Computer Vision (ICCV) 2021, 2021.
  45. High-resolution image synthesis with latent diffusion models, 2021.
  46. The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  47. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016.
  48. Pixelwise view selection for unstructured multi-view stereo. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part III 14, pages 501–518. Springer, 2016.
  49. BAD SLAM: Bundle adjusted direct RGB-D SLAM. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  50. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  51. Sparsity invariant cnns. In International Conference on 3D Vision (3DV), 2017.
  52. DIODE: A Dense Indoor and Outdoor DEpth Dataset. CoRR, abs/1908.00463, 2019.
  53. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020.
  54. Harnessing diffusion models for visual perception with meta prompts. arXiv preprint arXiv:2312.14733, 2023.
  55. Flow-motion and depth network for monocular stereo and beyond. CoRR, abs/1909.05452, 2019.
  56. IRS: A large synthetic indoor robotics stereo dataset for disparity and surface normal estimation. CoRR, abs/1912.09678, 2019.
  57. Dust3r: Geometric 3d vision made easy. In CVPR, 2024.
  58. Tartanair: A dataset to push the limits of visual slam. 2020.
  59. Camera calibration and 3d reconstruction from single images using parallelepipeds. In IEEE International Conference on Computer Vision, pages 142–148. IEEE, 2001.
  60. Argoverse 2: Next generation datasets for self-driving perception and forecasting. In Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks 2021), 2021.
  61. Deepfocal: A method for direct focal length estimation. In 2015 IEEE International Conference on Image Processing (ICIP), pages 1369–1373. IEEE, 2015.
  62. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in neural information processing systems, 34:12077–12090, 2021.
  63. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017.
  64. Cost volume pyramid based depth inference for multi-view stereo. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4877–4886, 2020.
  65. Depth anything: Unleashing the power of large-scale unlabeled data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10371–10381, 2024a.
  66. Depth anything v2. arXiv preprint arXiv:2406.09414, 2024b.
  67. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. Computer Vision and Pattern Recognition (CVPR), 2020.
  68. Diversedepth: Affine-invariant depth prediction using diverse data. arXiv preprint arXiv:2002.00569, 2020a.
  69. Learning to recover 3d scene shape from a single image. CoRR, abs/2012.09365, 2020b.
  70. Towards accurate reconstruction of 3d scene shape from a single monocular image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(5):6480–6494, 2022.
  71. Metric3d: Towards zero-shot metric 3d prediction from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9043–9053, 2023.
  72. Taskonomy: Disentangling task transfer learning. In 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018.
  73. Zhengyou Zhang. A flexible new technique for camera calibration. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(11):1330–1334, 2000.
  74. Structured3d: A large photo-realistic dataset for structured 3d modeling. In Proceedings of The European Conference on Computer Vision (ECCV), 2020.
  75. Tame a wild camera: in-the-wild monocular camera calibration. Advances in Neural Information Processing Systems, 36, 2024.
Citations (2)

Summary

  • The paper presents a novel direct geometry estimation method using affine-invariant point maps that eliminate focal-distance ambiguities in single-image depth recovery.
  • It employs a robust training strategy with global ROE alignment and multi-scale local supervision, achieving a 35% reduction in estimation errors across open-domain images.
  • The approach enhances practical applications by paving the way for advancements in 3D image editing, depth-to-image synthesis, and comprehensive scene understanding.

Monocular Geometry Estimation with MoGe

The paper "MoGe: Unlocking Accurate Monocular Geometry Estimation for Open-Domain Images with Optimal Training Supervision" presents a novel approach to 3D geometry recovery from single images, addressing a critical area in computer vision. The method, MoGe, distinguishes itself by predicting an affine-invariant 3D point map, a representation that is key to overcoming the inherent ambiguities of monocular estimation tasks.

Core Contribution

MoGe introduces a direct geometry estimation method utilizing affine-invariant point maps. Unlike previous models such as DUSt3R, which focus on scale-invariant representations for multi-view scenarios, MoGe effectively removes the focal-distance ambiguity. This is achieved through innovative training supervisions that enhance geometry learning.

Methodology

The model architecture is straightforward, directly mapping images to 3D point maps, from which depth maps and camera parameters can be derived. The use of affine-invariant point maps ensures the representation is free from global scale and shift uncertainties, facilitating robust training.

Key elements include:

  1. Global and Local Supervision:
    • A robust, optimal, and efficient (ROE) alignment solver computes point cloud alignment, enhancing global shape learning.
    • A multi-scale local geometry loss addresses local geometric precision by employing independent affine alignments.
  2. Training on Large-Scale Data: The model is trained on a diverse dataset corpus, demonstrating strong generalization abilities across open-domain images.

Numerical Results

The paper highlights MoGe's superior performance over existing methods in several benchmarks, achieving significant error reductions. For instance, MoGe achieved a 35% error reduction in monocular estimation tasks and over 20% in camera field of view predictions compared to the best previous approaches.

Implications

MoGe's contributions have far-reaching implications. By providing a reliable method for monocular geometry estimation, it paves the way for advancements in 3D-aware image editing, depth-to-image synthesis, and 3D scene understanding. Furthermore, it serves as a potent foundation model for further research in both video-based and multi-view 3D reconstruction.

Theoretical and Practical Impact

Theoretically, the affine-invariant representation and optimal supervision strategies introduced redefine existing approaches to address ambiguities in monocular tasks. Practically, the model's strong zero-shot performance across diverse datasets suggests it can be deployed across a variety of applications, enhancing the utility of monocular vision systems without requiring specialized training data or calibration procedures.

Future Directions

Looking ahead, integrating MoGe with other modalities, such as semantic segmentation and object recognition, could offer more comprehensive scene understanding capabilities. Moreover, expanding the model’s application to real-time systems might revolutionize fields requiring instantaneous 3D interpretation from single-view inputs.

In conclusion, MoGe represents a significant step forward in 3D geometry estimation, balancing innovation in training supervision with robust performance metrics. As the code and models are made available to the research community, they will likely spur further investigation and development in this vital area of computer vision.

Youtube Logo Streamline Icon: https://streamlinehq.com