Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SM4Depth: Seamless Monocular Metric Depth Estimation across Multiple Cameras and Scenes by One Model (2403.08556v2)

Published 13 Mar 2024 in cs.CV and cs.AI

Abstract: In the last year, universal monocular metric depth estimation (universal MMDE) has gained considerable attention, serving as the foundation model for various multimedia tasks, such as video and image editing. Nonetheless, current approaches face challenges in maintaining consistent accuracy across diverse scenes without scene-specific parameters and pre-training, hindering the practicality of MMDE. Furthermore, these methods rely on extensive datasets comprising millions, if not tens of millions, of data for training, leading to significant time and hardware expenses. This paper presents SM$4$Depth, a model that seamlessly works for both indoor and outdoor scenes, without needing extensive training data and GPU clusters. Firstly, to obtain consistent depth across diverse scenes, we propose a novel metric scale modeling, i.e., variation-based unnormalized depth bins. It reduces the ambiguity of the conventional metric bins and enables better adaptation to large depth gaps of scenes during training. Secondly, we propose a "divide and conquer" solution to reduce reliance on massive training data. Instead of estimating directly from the vast solution space, the metric bins are estimated from multiple solution sub-spaces to reduce complexity. Additionally, we introduce an uncut depth dataset, BUPT Depth, to evaluate the depth accuracy and consistency across various indoor and outdoor scenes. Trained on a consumer-grade GPU using just 150K RGB-D pairs, SM$4$Depth achieves outstanding performance on the most never-before-seen datasets, especially maintaining consistent accuracy across indoors and outdoors. The code can be found https://github.com/mRobotit/SM4Depth.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. State of the art of virtual reality technology. In IEEE Aerospace Conference, pages 1–19, 2016.
  2. Uasol, a large-scale high-resolution outdoor stereo dataset. Scientific data, 6(1):162, 2019.
  3. Adabins: Depth estimation using adaptive bins. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  4. Localbins: Improving depth estimation by learning local distributions. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part I, pages 480–496. Springer, 2022.
  5. Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023.
  6. N-qgn: Navigation map from a monocular camera using quadtree generating networks. In International Conference on Robotics and Automation (ICRA), 2022.
  7. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
  8. Structure-aware residual pyramid network for monocular depth estimation. In International Joint Conference on Artificial Intelligence (IJCAI), 2019.
  9. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022.
  10. DIML/CVL RGB-D dataset: 2m rgb-d images of natural indoor and outdoor scenes. arXiv preprint arXiv:2110.11590, 2021.
  11. Depth-aware image colorization network. In ACMMM Workshop, 2018.
  12. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  13. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.
  14. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. In IEEE International Conference on Computer Vision (ICCV), 2015.
  15. Depth map prediction from a single image using a multi-scale deep network. In Neural Information Processing Systems (NIPS), 2014.
  16. A point set generation network for 3d object reconstruction from a single image. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  17. Deep ordinal regression network for monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  18. 3d packing for self-supervised monocular depth estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  19. Monodtr: Monocular 3d object detection with depth-aware transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  20. The apolloscape dataset for autonomous driving. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 954–960, 2018.
  21. Depth-aware neural style transfer using instance normalization. In Computer Graphics & Visual Computing (CGVC) 2022. Eurographics Digital Library, 2022.
  22. Evaluation of cnn-based single-image depth estimation methods. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pages 0–0, 2018.
  23. Deeper depth prediction with fully convolutional residual networks. In International Conference on 3D Vision (3DV), 2016.
  24. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019.
  25. Practical stereo matching via cascaded recurrent network with adaptive correlation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16263–16272, 2022a.
  26. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018.
  27. Binsformer: Revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987, 2022b.
  28. Neural rgb (r) d sensing: Depth and uncertainty from a video camera. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  29. Depth-aware neural style transfer. In Proceedings of the symposium on non-photorealistic animation and rendering, 2017.
  30. Depth-aware image vectorization and editing. The Visual Computer, 35(6):1027–1039, 2019.
  31. Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In ECCV, 2012.
  32. Learning depth-guided convolutions for monocular 3d object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2020.
  33. Monocular 3d human pose estimation by predicting depth on joints. In IEEE International Conference on Computer Vision (ICCV), 2017.
  34. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE transactions on pattern analysis and machine intelligence, 44(3):1623–1637, 2020.
  35. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12179–12188, 2021a.
  36. Vision transformers for dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12179–12188, 2021b.
  37. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In International Conference on Computer Vision (ICCV) 2021, 2021.
  38. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017a.
  39. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017b.
  40. Monocular depth distribution alignment with low computation. In International Conference on Robotics and Automation (ICRA), 2022.
  41. SUN RGB-D: A rgb-d scene understanding benchmark suite. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 567–576, 2015.
  42. Robust and long-term monocular teach and repeat navigation using a single-experience map. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021.
  43. Putting people in their place: Monocular regression of 3d people in depth. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  44. CNN-SLAM: Real-time dense monocular slam with learned depth prediction. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  45. Sparsity invariant cnns. In International Conference on 3D Vision (3DV), 2017.
  46. DIODE: A Dense Indoor and Outdoor DEpth Dataset. CoRR, abs/1908.00463, 2019.
  47. Mvdepthnet: Real-time multiview depth estimation neural network. In International Conference on 3D Vision (3DV), 2018.
  48. Revealing the dark secrets of masked image modeling. arXiv preprint arXiv:2205.13543, 2022.
  49. A novel multi-layer framework for tiny obstacle discovery. In IEEE International Conference on Robotics and Automation (ICRA), 2019.
  50. Tiny obstacle discovery by occlusion-aware multilayer regression. IEEE Transactions on Image Processing (TIP), 29:9373–9386, 2020.
  51. Boundary-induced and scene-aggregated network for monocular depth prediction. Pattern Recognition (PR), 115:107901, 2021.
  52. Deep visual odometry with adaptive memory. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(2):940–954, 2022.
  53. Indoor Obstacle Discovery on Reflective Ground via Monocular Camera. International Journal of Computer Vision, 2023.
  54. Enforcing geometric constraints of virtual normal for depth prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5684–5693, 2019.
  55. Towards accurate reconstruction of 3d scene shape from a single monocular image. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  56. Metric3d: Towards zero-shot metric 3d prediction from a single image. In IEEE International Conference on Computer Vision (ICCV), 2023.
  57. New crfs: Neural window fully-connected crfs for monocular depth estimation. arXiv preprint arXiv:2203.01502, 2022.
  58. Towards scale consistent monocular visual odometry by learning from the virtual world. In International Conference on Robotics and Automation (ICRA), 2022.
  59. Revealing the dark secrets of masked image modeling. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.

Summary

We haven't generated a summary for this paper yet.