Deep Learning for Image and Point Cloud Fusion in Autonomous Driving: A Review (2004.05224v2)

Published 10 Apr 2020 in cs.CV, cs.LG, and cs.RO

Abstract: Autonomous vehicles were experiencing rapid development in the past few years. However, achieving full autonomy is not a trivial task, due to the nature of the complex and dynamic driving environment. Therefore, autonomous vehicles are equipped with a suite of different sensors to ensure robust, accurate environmental perception. In particular, the camera-LiDAR fusion is becoming an emerging research theme. However, so far there has been no critical review that focuses on deep-learning-based camera-LiDAR fusion methods. To bridge this gap and motivate future research, this paper devotes to review recent deep-learning-based data fusion approaches that leverage both image and point cloud. This review gives a brief overview of deep learning on image and point cloud data processing. Followed by in-depth reviews of camera-LiDAR fusion methods in depth completion, object detection, semantic segmentation, tracking and online cross-sensor calibration, which are organized based on their respective fusion levels. Furthermore, we compare these methods on publicly available datasets. Finally, we identified gaps and over-looked challenges between current academic researches and real-world applications. Based on these observations, we provide our insights and point out promising research directions.

Authors (7)

Yaodong Cui (3 papers)
Ren Chen (7 papers)
Wenbo Chu (1 paper)
Long Chen (395 papers)
Daxin Tian (7 papers)
Ying Li (432 papers)
Dongpu Cao (26 papers)

Citations (339)

View on Semantic Scholar

Summary

Deep Learning for Image and Point Cloud Fusion in Autonomous Driving: A Review

The paper of sensor fusion using deep learning techniques in the domain of autonomous driving has seen a notable surge due to its potential to improve both environmental perception and system robustness. The reviewed paper, "Deep Learning for Image and Point Cloud Fusion in Autonomous Driving: A Review," offers a comprehensive examination of methodologies that synergize camera and LiDAR data for vehicular applications. The exploration therein is methodologically segmented across tasks of depth completion, object detection, semantic segmentation, tracking, and online cross-sensor calibration.

Depth Completion

The review explores depth completion techniques that aim to densify sparse LiDAR point clouds by leveraging high-resolution images. It distinguishes between various levels of fusion, particularly signal-level and feature-level approaches. Models such as Sparse2Dense+ and CSPN++ emerge as leading strategies, optimizing for either supervised or self-supervised learning schemes. Robust performance has been demonstrated on established benchmarks like KITTI, with innovations focusing on dynamically learning convolutional kernels to enhance computational efficiency while maintaining precision.

3D Object Detection

In the field of 3D object detection, the paper categorizes methodologies into sequential and one-step models, predominantly focusing on the former. Frustum-based approaches, including F-PointNet and IPOD, are highlighted for their ability to limit the 3D search space effectively by initially employing 2D proposals. However, the integration of image semantics into LiDAR data, as executed by PointPainting, provides a promising fusion direction that bridges the limitations of resolution disparities between modalities. Further exploration in multi-view methods and voxel representations are in spotlight, with models such as MV3D and MVX-Net leading in performance by efficiently leveraging bird's eye view (BEV) mappings of point clouds.

Semantic Segmentation and Tracking

For semantic segmentation, the paper contrasts 2D and 3D approaches, noting that point-cloud specific segmentation networks like MVPNet use geometric fidelity from LiDAR to advance per-point classification accuracy. In instance segmentation, efforts such as 3D-SIS explore voxel-wise approaches, providing intricate segmentation that includes object instances.

Tracking in autonomous systems is approached through Detection-Based Tracking (DBT) and Detection-Free Tracking (DFT) frameworks. In this context, the tracking-by-detection paradigm serves to associate sequential detections using strategies like min-cost flow, improving via models such as mmMOT that incorporate robust multi-modal adjacency learning.

Online Cross-Sensor Calibration

The paper concludes with insights into online calibration challenges, vital for consistent sensor alignment during vehicle operation. Classical approaches are compared against novel deep learning strategies like CalibNet, which optimize calibration utilising both geometric and photometric metrics in a self-supervised manner. The move towards integrating calibration into the perception stack transparently is an ongoing research imperative.

Implications and Future Directions

The implications of this research extend into the practical domain where autonomous driving systems aim to achieve superior reliability and safety. By addressing each task's challenges and proposing innovative fusion methods, the paper establishes a foundation for enhancing system robustness. Future directions speculated by the authors include advancing sensor-agnostic frameworks, embracing unsupervised learning paradigms, and incorporating temporal context to enhance prediction accuracy and responsiveness.

In summary, the paper not only surveys the state-of-the-art in multi-modal fusion for autonomous driving but also stimulates discourse on optimizing and integrating these technologies into real-world applications. The evolution towards holistic and computationally viable fusion methods remains central to closing the gap between current academic results and application-level demands in dynamic driving environments.

PDF Markdown