Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UniDepth: Universal Monocular Metric Depth Estimation (2403.18913v1)

Published 27 Mar 2024 in cs.CV

Abstract: Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However, the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps, which hinders their practical applicability. We propose a new model, UniDepth, capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE methods, UniDepth directly predicts metric 3D points from the input image at inference time without any additional information, striving for a universal and flexible MMDE solution. In particular, UniDepth implements a self-promptable camera module predicting dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation, which disentangles camera and depth representations. In addition, we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. Thorough evaluations on ten datasets in a zero-shot regime consistently demonstrate the superior performance of UniDepth, even when compared with methods directly trained on the testing domains. Code and models are available at: https://github.com/lpiccinelli-eth/unidepth

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Mapillary planet-scale depth dataset. In The European Conference Computer Vision (ECCV), pages 589–604. Springer International Publishing, 2020.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. Adabins: Depth estimation using adaptive bins. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4008–4017, 2020.
  4. Zoedepth: Zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288, 2023.
  5. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  6. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9650–9660, 2021.
  7. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  8. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  9. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255, 2009.
  10. Depth-supervised nerf: Fewer views and faster training for free. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12882–12891, 2022.
  11. Towards real-time monocular depth estimation for robotics: A survey. IEEE Transactions on Intelligent Transportation Systems, 23(10):16940–16961, 2022.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR). OpenReview.net, 2021.
  13. Omnidata: A scalable pipeline for making multi-task mid-level vision datasets from 3d scans. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10786–10796, 2021.
  14. Depth map prediction from a single image using a multi-scale deep network. pages 2366–2374. Neural information processing systems foundation, 2014.
  15. Cam-convs: Camera-aware multi-scale convolutions for single-view depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11826–11835, 2019.
  16. Deep ordinal regression network for monocular depth estimation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2002–2011, 2018.
  17. Unsupervised cnn for single view depth estimation: Geometry to the rescue. Lecture Notes in Computer Science, 9912 LNCS:740–756, 2016.
  18. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  19. A2D2: Audi Autonomous Driving Dataset. arXiv preprint arXiv:2004.06320, 2020.
  20. 3d packing for self-supervised monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  21. Towards zero-shot scale-aware monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9233–9243, 2023.
  22. Deep residual learning for image recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016-December:770–778, 2015.
  23. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415, 2016.
  24. Densely connected convolutional networks. IEEE Conference on Computer Vision and Pattern Recognition, 2017-January:2261–2269, 2016.
  25. Is my depth ground-truth good enough? HAMMER – Highly Accurate Multi-Modal dataset for dEnse 3D scene Regression. arXiv preprint arXiv:2205.04565, 2022.
  26. Comparison of monocular depth estimation methods using geometrically relevant metrics on the IBims-1 dataset. Computer Vision and Image Understanding (CVIU), 191:102877, 2020.
  27. Deeper depth prediction with fully convolutional residual networks. Proceedings of the International Conference on 3D Vision (3DV), pages 239–248, 2016.
  28. From big to small: Multi-scale local planar guidance for monocular depth estimation. CoRR, abs/1907.10326, 2019.
  29. Single image depth prediction made better: A multivariate gaussian take. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17346–17356, 2023a.
  30. Va-depthnet: A variational approach to single image depth prediction. arXiv preprint arXiv:2302.06556, 2023b.
  31. Learning depth from single monocular images using deep convolutional neural fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 38:2024–2039, 2015.
  32. Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021.
  33. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11976–11986, 2022.
  34. Decoupled weight decay regularization. 7th International Conference on Learning Representations, ICLR 2019, 2017.
  35. Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In The European Conference Computer Vision (ECCV), 2012.
  36. Scalable parallel programming with cuda: Is cuda the parallel programming model that application developers have been waiting for? Queue, 6(2):40–53, 2008.
  37. From 2d to 3d: Re-thinking benchmarking of monocular depth prediction. arXiv preprint arXiv:2203.08122, 2022.
  38. Is pseudo-lidar needed for monocular 3d object detection? In IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  39. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems (NeurIPS), pages 8024–8035. Curran Associates, Inc., 2019.
  40. P3Depth: Monocular depth estimation with a piecewise planarity prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1600–1611. IEEE, 2022.
  41. iDisc: Internal discretization for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  42. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence (T-PAMI), 44(3):1623–1637, 2020.
  43. Vision transformers for dense prediction. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12159–12168, 2021.
  44. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  45. Urcdc-depth: Uncertainty rectified cross-distillation with cutflip for monocular depth estimation. arXiv preprint arXiv:2302.08149, 2023a.
  46. Nddepth: Normal-distance assisted monocular depth estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7931–7940, 2023b.
  47. Iebins: Iterative elastic bins for monocular depth estimation. arXiv preprint arXiv:2309.14137, 2023c.
  48. Sun rgb-d: A rgb-d scene understanding benchmark suite. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 07-12-June-2015:567–576, 2015.
  49. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2446–2454, 2020.
  50. DIODE: A dense indoor and outdoor depth dataset. CoRR, abs/1908.00463, 2019.
  51. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8445–8453, 2019.
  52. Train in germany, test in the usa: Making 3d object detectors generalize. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11710–11720, 2020.
  53. Argoverse 2: Next generation datasets for self-driving perception and forecasting. In Advances in Neural Information Processing Systems, 2021.
  54. Unsupervised depth completion from visual inertial odometry. IEEE Robotics and Automation Letters (RA-L), 5(2):1899–1906, 2020.
  55. Adversarial examples improve image recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 816–825, 2019.
  56. Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  57. Transformer-based attention networks for continuous pixel-wise prediction. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16249–16259, 2021.
  58. Learning to recover 3d scene shape from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 204–213, 2021.
  59. Metric3d: Towards zero-shot metric 3d prediction from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9043–9053, 2023.
  60. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2636–2645, 2020.
  61. Neural window fully-connected crfs for monocular depth estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3906–3915. IEEE, 2022.
  62. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2018.
  63. Does computer vision matter for action? Science Robotics, 4, 2019.
Citations (56)

Summary

  • The paper introduces a calibration-free monocular depth estimation model that achieves robust zero-shot performance on ten diverse datasets.
  • It leverages a self-promptable camera module and a pseudo-spherical depth representation to decouple camera and scene features.
  • The approach employs a geometric invariance loss to ensure consistent depth predictions, setting a new benchmark on the KITTI dataset.

Universal Monocular Metric Depth Estimation with UniDepth

The academic paper titled "UniDepth: Universal Monocular Metric Depth Estimation" proposes an advanced approach for monocular metric depth estimation (MMDE) that seeks to overcome existing limits on cross-domain generalizability. Herein is an expert analysis of its contributions, methodology, and broader implications.

Existing MMDE methods typically struggle with domain generalization, primarily because they are tuned and validated in environments with consistent camera parameters and scene characteristics. As such, transferring these learned models to new environments often results in significant performance degradations due to unseen domain characteristics or varying camera settings. UniDepth addresses these challenges by proposing a model that offers robust zero-shot performance across multiple domains without requiring prior calibration information at inference time.

Methodology and Innovations

UniDepth is built around a framework that aims to predict metric 3D points from single-view images, leveraging innovative architectural elements to achieve universality and adaptability:

  1. Self-Promptable Camera Module: A critical component of UniDepth is its camera module, which offers a self-prompting mechanism to predict dense camera representations from input images alone. This enables the model to adapt to different camera optics and scene compositions dynamically, thus broadening its applicability across varied data domains without camera parameter constraints.
  2. Pseudo-Spherical Output Representation: By employing a pseudo-spherical output space characterized by azimuth, elevation angles, and depth, UniDepth effectively decomposes image-based depth estimation into camera and depth feature factors. This representation circumvents the intertwined gradients that surface with traditional Cartesian modeling, allowing for more precise optimization of distinct camera and depth parameters.
  3. Geometric Invariance Loss: The paper introduces a geometric invariance loss, ensuring the camera-conditioned depth features remain consistent across varying views of the same scene. This loss enforces feature consistency under different geometric augmentations, pushing the depth features to be invariant to the inherent camera setup—critical for augmenting the robustness of depth estimation.
  4. Universal Zero-Shot Performance: Through rigorous empirical testing across ten diverse datasets, UniDepth demonstrates its capabilities to outperform existing MMDE methodologies even when these are directly trained on the target datasets. Notably, UniDepth achieves leading performance on the official KITTI Depth Prediction Benchmark, underscoring its functionality in real-world top-tier benchmarks.

Implications and Future Perspectives

The introduction of UniDepth challenges the MMDE field by demonstrating that comprehensive depth estimation models can be trained without domain-specific tuning or reliance on camera intrinsics. Its architecture, particularly the disentangled depth-camera processing, sets a precedent for future research in creating models resilient to both domain and camera variation noise—a common issue in practical applications like autonomous driving and robotics.

The flexibility conferred by the model's agnostic approach to camera intrinsics holds potential for extending 3D perception solutions without strict calibration demands across heterogeneous devices and scenarios, such as crowd-sourced image data analysis and in-the-wild video processing.

Looking ahead, further exploration of how architectural advancements can continue to decouple feature dependencies in deep models may yield even more sophisticated solutions in dynamic environments. While UniDepth offers a robust foundation, complementary approaches in data augmentation and domain adaptation may further enhance cross-domain prowess.

Conclusion

UniDepth signifies a pivotal stride in developing universally applicable monocular depth estimation systems, breaking away from conventional dependency on homogeneous training environments. Its approach provides a blueprint for future models aiming for deployment across a wide spectrum of real-world conditions. The drive to generalize MMDE and integrate techniques like UniDepth's pseudo-spherical representation, dense camera module prompting, and geometric loss functions highlights important trends and directions in 3D artificial perception research.