Papers
Topics
Authors
Recent
Search
2000 character limit reached

Scalable Vision-Based 3D Object Detection and Monocular Depth Estimation for Autonomous Driving

Published 4 Mar 2024 in cs.CV and cs.RO | (2403.02037v1)

Abstract: This dissertation is a multifaceted contribution to the advancement of vision-based 3D perception technologies. In the first segment, the thesis introduces structural enhancements to both monocular and stereo 3D object detection algorithms. By integrating ground-referenced geometric priors into monocular detection models, this research achieves unparalleled accuracy in benchmark evaluations for monocular 3D detection. Concurrently, the work refines stereo 3D detection paradigms by incorporating insights and inferential structures gleaned from monocular networks, thereby augmenting the operational efficiency of stereo detection systems. The second segment is devoted to data-driven strategies and their real-world applications in 3D vision detection. A novel training regimen is introduced that amalgamates datasets annotated with either 2D or 3D labels. This approach not only augments the detection models through the utilization of a substantially expanded dataset but also facilitates economical model deployment in real-world scenarios where only 2D annotations are readily available. Lastly, the dissertation presents an innovative pipeline tailored for unsupervised depth estimation in autonomous driving contexts. Extensive empirical analyses affirm the robustness and efficacy of this newly proposed pipeline. Collectively, these contributions lay a robust foundation for the widespread adoption of vision-based 3D perception technologies in autonomous driving applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (116)
  1. Bidirectional attention network for monocular depth estimation. arXiv preprint arXiv:2009.00743, 2020.
  2. Unsupervised depth completion from visual inertial odometry. IEEE Robotics and Automation Letters, 5(2):1899–1906, 2020.
  3. Ensemble knowledge distillation for learning improved and efficient networks. ArXiv, abs/1909.08097, 2020.
  4. Unsupervised Scale-Consistent Depth and Ego-Motion Learning from Monocular Video. Curran Associates Inc., Red Hook, NY, USA, 2019.
  5. G. Brazil and X. Liu. M3d-rpn: Monocular 3d region proposal network for object detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9286–9295, 2019.
  6. nuscenes: A multimodal dataset for autonomous driving. arXiv preprint arXiv:1903.11027, 2019.
  7. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE Transactions on Robotics, 37(6):1874–1890, 2021.
  8. Gvins: Tightly coupled gnss–visual–inertial fusion for smooth and consistent state estimation. IEEE Transactions on Robotics, 2022.
  9. Pyramid stereo matching network. CoRR, abs/1803.08669, 2018.
  10. Riemannian walk for incremental learning: Understanding forgetting and intransigence. 01 2018.
  11. Learning efficient object detection models with knowledge distillation. In NIPS, 2017.
  12. 3d object proposals for accurate object class detection. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 424–432. Curran Associates, Inc., 2015.
  13. Dsgn: Deep stereo geometry network for 3d object detection. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  14. Monopair: Monocular 3d object detection using pairwise spatial relationships. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  15. Image detector based automatic 3d data labeling and training for vehicle detection on point cloud. In 2019 IEEE Intelligent Vehicles Symposium (IV), pages 1408–1413, June 2019.
  16. Depth estimation via affinity learned with convolutional spatial propagation network. In Proceedings of the European Conference on Computer Vision (ECCV), pages 103–119, 2018.
  17. Monodistill: Learning spatial features for monocular 3d object detection. ArXiv, abs/2201.10830, 2022.
  18. Floor detection based depth estimation from a single indoor scene. In 2013 IEEE International Conference on Image Processing, pages 3358–3362, 2013.
  19. Depth map prediction from a single image using a multi-scale deep network. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014.
  20. A dynamic bayesian network model for autonomous 3d reconstruction from a single indoor image. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 2418–2428, 2006.
  21. Learning depth-guided convolutions for monocular 3d object detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11669–11678, 2020.
  22. High quality depth estimation from monocular images based on depth prediction and enhancement sub-networks. In 2018 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6, 2018.
  23. R. Díaz and A. Marathe. Soft labels for ordinal regression. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4733–4742, 2019.
  24. Camera-based navigation of a low-cost quadrocopter. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 2815–2821. IEEE, 2012.
  25. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
  26. Svo: Semidirect visual odometry for monocular and multicamera systems. IEEE Transactions on Robotics, 33(2):249–265, 2016.
  27. Deep ordinal regression network for monocular depth estimation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2002–2011, 2018.
  28. Bradski G. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
  29. Cityscapes 3d: Dataset and benchmark for 9 dof vehicle detection. arXiv preprint arXiv:2006.07864, 2020.
  30. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 3354–3361, 2012.
  31. R. Girshick. Fast r-cnn. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1440–1448, 2015.
  32. Unsupervised monocular depth estimation with left-right consistency. In CVPR, 2017.
  33. Digging into self-supervised monocular depth prediction. October 2019.
  34. Full surround monodepth from multiple cameras. IEEE Robotics and Automation Letters, 7(2):5397–5404, 2022.
  35. Ghostnet: More features from cheap operations. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  36. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
  37. Realtime 3d object detection for automated driving using stereo vision and semantic information. pages 1405–1410, 10 2019.
  38. Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In Asian Conference on Computer Vision, pages 548–562, 2012.
  39. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015.
  40. Deep Ordinal Regression Network for Monocular Depth Estimation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  41. Monocular 3d object detection and box fitting trained end-to-end using intersection-over-union loss. arXiv preprint arXiv:1906.08070, abs/1906.08070, 2019.
  42. Andrej Karpathy. Ai for full-self driving at tesla. 5th Annual Scaled Machine Learning Conference 2020, 2020.
  43. Autoware on board: Enabling autonomous vehicles with embedded systems. In 2018 ACM/IEEE 9th International Conference on Cyber-Physical Systems (ICCPS), pages 287–296, 2018.
  44. End-to-end learning of geometry and context for deep stereo regression. CoRR, abs/1703.04309, 2017.
  45. Parallel tracking and mapping on a camera phone. In 2009 8th IEEE International Symposium on Mixed and Augmented Reality, pages 83–86. IEEE, 2009.
  46. Deterministic guided lidar depth map completion, 06 2021.
  47. Monocular 3d object detection leveraging accurate proposals and shape reconstruction. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11859–11868, 2019.
  48. In defense of classical image processing: Fast depth completion on the cpu. In 2018 15th Conference on Computer and Robot Vision (CRV), pages 16–22. IEEE, 2018.
  49. From big to small: Multi-scale local planar guidance for monocular depth estimation. arXiv preprint arXiv:1907.10326, 2019.
  50. Rtm3d: Real-time monocular 3d detection from object keypoints for autonomous driving. In European Conference on Computer Vision (ECCV), pages 644–660, 2020.
  51. Stereo r-cnn based 3d object detection for autonomous driving. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  52. Peixuan Li. Monocular 3d detection with geometric constraints embedding and semi-supervised training, 2020.
  53. KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d. arXiv.org, 2109.13410, 2021.
  54. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP:1–1, 07 2018.
  55. Learning by analogy: Reliable supervision from transformations for unsupervised optical flow estimation. In IEEE Conference on Computer Vision and Pattern Recognition(CVPR), 2020.
  56. An intriguing failing of convolutional neural networks and the coordconv solution. In Advances in Neural Information Processing Systems, 11 2018.
  57. The role of the hercules autonomous vehicle during the covid-19 pandemic: An autonomouss logistic vehicle for contactless goods transportation. IEEE Robotics & Automation Magazine, 28(1):48–58, 2021.
  58. Ground-aware monocular 3d object detection for autonomous driving. IEEE Robotics and Automation Letters, 2021.
  59. Yolostereo3d: A step back to 2d for efficient stereo 3ddetection. In arXiv preprint arXiv:2102.15072, 2021.
  60. Fsnet: Redesign self-supervised monodepth for full-scale depth prediction for autonomous driving. IEEE Transactions on Automation Science and Engineering, pages 1–11, 2023.
  61. Smoke: Single-stage monocular 3d object detection via keypoint estimation. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 4289–4298, 2020.
  62. Prediction, planning, and coordination of thousand-warehousing-robot networks with motion and communication uncertainties. IEEE Transactions on Automation Science and Engineering, 18(4):1705–1717, 2021.
  63. Efficient deep learning for stereo matching. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5695–5703, 2016.
  64. Every dataset counts: Scaling up monocular 3d object detection with joint datasets training, 2023.
  65. Accurate monocular 3d object detection via color-embedded 3d reconstruction for autonomous driving. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 6850–6859, 10 2019.
  66. Robot operating system 2: Design, architecture, and uses in the wild. Science Robotics, 7(66):eabm6074, 2022.
  67. One million scenes for autonomous driving: Once dataset. 2021.
  68. Monocular depth estimation using deep learning: A review. Sensors, 22(14), 2022.
  69. Single image depth prediction with wavelet decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2021.
  70. Shift r-cnn: Deep monocular 3d object detection with closed-form geometric constraints. In 2019 IEEE International Conference on Image Processing (ICIP), pages 61–65, 2019.
  71. Pushmeet Kohli Nathan Silberman, Derek Hoiem and Rob Fergus. Indoor segmentation and support inference from rgbd images. In European Conference on Computer Vision (ECCV), pages 746–760, 2012.
  72. Object-centric stereo matching for 3d object detection. arXiv preprint arXiv:1909.07566, 2019.
  73. End-to-end pseudo-lidar for image-based 3d object detection. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  74. Vins-mono: A robust and versatile monocular visual-inertial state estimator. IEEE Transactions on Robotics, 34(4):1004–1020, 2018.
  75. Depth prediction for monocular direct visual odometry. In 2020 17th Conference on Computer and Robot Vision (CRV), pages 70–77, 2020.
  76. Competitive collaboration: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. pages 12232–12241, 06 2019.
  77. Categorical depth distributionnetwork for monocular 3d object detection. CVPR, 2021.
  78. Yolov3: An incremental improvement. arXiv, 2018.
  79. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, volume abs/1505.04597, pages 234–241, 2015.
  80. Training region-based object detectors with online hard example mining. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 761–769, 2016.
  81. Disentangling monocular 3d object detection: From single to multi-class recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2020.
  82. Pointtracknet: An end-to-end network for 3-d object detection and tracking from point clouds. IEEE Robotics and Automation Letters, PP:1–1, 02 2020.
  83. Disp r-cnn: Stereo 3d object detection via shape prior guided instance disparity estimation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  84. Fuseseg: Semantic segmentation of urban scenes based on rgb and thermal data fusion. IEEE Transactions on Automation Science and Engineering, 18(3):1000–1011, 2021.
  85. Refinedmpl: Refined monocular pseudolidar for 3d object detection in autonomous driving. arXiv preprint, abs/1911.09712, 11 2019.
  86. Tom van Dijk and Guido C. H. E. de Croon. How do neural networks see depth in single images? CoRR, abs/1905.07005, 2019.
  87. DIODE: A Dense Indoor and Outdoor DEpth Dataset. arXiv preprint, abs/1908.00463, 2019.
  88. Fadnet: A fast and accurate network for disparity estimation. arXiv preprint arXiv:2003.10758, 2020.
  89. Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. arXiv preprint arXiv:1812.07179, 2018.
  90. The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth. In Computer Vision and Pattern Recognition (CVPR), 2021.
  91. Self-supervised monocular depth hints. In The International Conference on Computer Vision (ICCV), October 2019.
  92. Surrounddepth: Entangling surrounding views for self-supervised multi-camera depth estimation. arXiv preprint arXiv:2204.03636, 2022.
  93. X. Weng and K. Kitani. Monocular 3d object detection with pseudo-lidar point cloud. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 857–866, 2019.
  94. Monorec: Semi-supervised dense reconstruction in dynamic environments from a single moving camera. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
  95. Zoomnet: Part-aware adaptive zooming neural network for 3d object detection. arXiv preprint arXiv:2003.00529, 2020.
  96. Centerlinedet: Centerline graph detection for road lanes with vehicle-mounted sensors by transformer for hd map generation. pages 3553–3559, 05 2023.
  97. Rngdet++: Road network graph detection by transformer with instance segmentation and multi-scale features enhancement. IEEE Robotics and Automation Letters, 8(5):2991–2998, 2023.
  98. Toward hierarchical self-supervised monocular absolute depth estimation for autonomous driving applications. In 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2330–2337. IEEE, 2020.
  99. Knowledge distillation via adaptive instance normalization. ArXiv, abs/2003.04289, 2020.
  100. D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  101. Enforcing geometric constraints of virtual normal for depth prediction. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 5683–5692, 2019.
  102. Pseudo-lidar++: Accurate depth for 3d object detection in autonomous driving. In International Conference on Learning Representations (ICLR), 2019.
  103. Deep layer aggregation. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2403–2412, 2018.
  104. Bdd100k: A diverse driving dataset for heterogeneous multitask learning. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2633–2642, 2020.
  105. Accurate and robust visual localization system in large-scale appearance-changing environments. IEEE/ASME Transactions on Mechatronics, pages 1–11, 2022.
  106. Conflicts between likelihood and knowledge distillation in task incremental learning for 3d object detection. In 2021 International Conference on 3D Vision (3DV), pages 575–585, 2021.
  107. In defense of knowledge distillation for task incremental learning and its application in 3d object detection. IEEE Robotics and Automation Letters, 6(2):2012–2019, 2021.
  108. Focal loss in 3d object detection. IEEE Robotics and Automation Letters, 4(2):1263–1270, April 2019.
  109. Regularizing class-wise predictions via self-knowledge distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13876–13885, 2020.
  110. Stereo matching by training a convolutional neural network to compare image patches. CoRR, abs/1510.05970, 2015.
  111. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3712–3721, 2019.
  112. Adaptive unimodal cost volume filtering for deep stereo matching. arXiv preprint arXiv:1909.03751, 2019.
  113. Objects are different: Flexible monocular 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3289–3298, June 2021.
  114. Pattern-affinitive propagation across depth, surface normal and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4106–4115, 2019.
  115. Objects as points. In arXiv preprint arXiv:1904.07850, 2019.
  116. Deformable convnets v2: More deformable, better results. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9300–9308, 2019.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.