EPNet++: Cascade Bi-directional Fusion for Multi-Modal 3D Object Detection (2112.11088v4)

Published 21 Dec 2021 in cs.CV

Abstract: Recently, fusing the LiDAR point cloud and camera image to improve the performance and robustness of 3D object detection has received more and more attention, as these two modalities naturally possess strong complementarity. In this paper, we propose EPNet++ for multi-modal 3D object detection by introducing a novel Cascade Bi-directional Fusion~(CB-Fusion) module and a Multi-Modal Consistency~(MC) loss. More concretely, the proposed CB-Fusion module enhances point features with plentiful semantic information absorbed from the image features in a cascade bi-directional interaction fusion manner, leading to more powerful and discriminative feature representations. The MC loss explicitly guarantees the consistency between predicted scores from two modalities to obtain more comprehensive and reliable confidence scores. The experimental results on the KITTI, JRDB and SUN-RGBD datasets demonstrate the superiority of EPNet++ over the state-of-the-art methods. Besides, we emphasize a critical but easily overlooked problem, which is to explore the performance and robustness of a 3D detector in a sparser scene. Extensive experiments present that EPNet++ outperforms the existing SOTA methods with remarkable margins in highly sparse point cloud cases, which might be an available direction to reduce the expensive cost of LiDAR sensors. Code is available at: https://github.com/happinesslz/EPNetV2.

Citations (81)

View on Semantic Scholar

Summary

The paper introduces a novel cascade bi-directional fusion module (CB-Fusion) that combines image and LiDAR features at multiple scales for enhanced 3D detection.
The architecture features a two-stream Region Proposal Network (RPN) and a Multi-Modal Consistency loss to align sensor predictions and improve proposal quality.
Empirical results on KITTI, SUN-RGBD, and JRDB show EPNet++ significantly outperforms existing methods, especially in sparse point cloud scenarios.

The paper "EPNet++: Cascade Bi-directional Fusion for Multi-Modal 3D Object Detection" introduces a novel approach for enhancing 3D object detection capabilities by leveraging the complementary information provided by LiDAR point clouds and camera images. The proposed framework, EPNet++, implements a Cascade Bi-directional Fusion module (CB-Fusion) and a Multi-Modal Consistency loss (MC loss) to bolster 3D object detection robustness, especially in sparse point cloud scenarios.

Technical Overview

EPNet++ comprises two primary components: a two-stream Region Proposal Network (RPN) and a refinement network. The architecture facilitates the integration of LiDAR and camera data both at feature and loss levels. The CB-Fusion module enables bidirectional interactions between image and point cloud features, allowing enriched feature propagation to produce 3D detections with improved accuracy and consistency across modalities. This dual-channel fusion stars by enriching image features with point cloud features, then reciprocally enhances point cloud features with the semantic richness from image data.

The two-stream RPN consists of an image stream and a geometric stream, bridged through CB-Fusion and LI-Fusion modules. The image stream employs convolutional layers to extract image features at multiple scales. Meanwhile, the geometric stream utilizes PointNet++ with Set Abstraction layers to process LiDAR data hierarchically. This modular integration occurs at varying scales to optimize representational power.

Significantly, the MC loss aligns the confidence estimates from both sensor modalities (LiDAR and image), ensuring consistency in predictions and facilitating high-quality proposal selection during the RPN stage.

Empirical Results

The approach delivered measurable improvements on benchmark datasets, including KITTI, SUN-RGBD, and JRDB. In particular, EPNet++ showcased superiority over existing state-of-the-art methodologies in sparse point cloud contexts, emphasizing its potential in reducing dependence on high-resolution LiDAR sensors.

For KITTI, EPNet++ surpassed competitors in Cars and Pedestrians categories across various difficulty levels, reinforcing its efficacy. Furthermore, on JRDB datasets, EPNet++ outperformed other methods significantly, affirming its robustness and efficiency in diverse environments. On SUN-RGBD, applying CB-Fusion and LI-Fusion modules to VoteNet and ImVoteNet elevated their performance, demonstrating the paper’s solutions’ versatility across different architectures.

Implications and Future Directions

By deploying EPNet++, the cost and operational efficiency of autonomous systems that rely on LiDAR sensors can be substantially improved. Besides reducing sensor expenses, the approach elevates detection accuracy and reliability, encouraging further exploration into bi-directional multi-modal fusion strategies.

Looking ahead, exploring streamlined network architectures and deeper feature extractors for image processing remains a viable path. Additionally, examining more sparse scenes with less dense LiDAR data could unveil further practical insights, advocating refinement and adaptation to tackle varied real-world scenarios. Implementing this model in transformer architectures further emphasizes scalability and adaptability across modern AI paradigms.

The paper furthers our understanding of sensor fusion within AI frameworks, contributing significant value to advancing both theoretical constructs and practical applications in 3D object detection realms. It opens pathways for integrating nuanced fusion strategies that leverage modality complementarity in more sophisticated ways.

PDF Markdown

Related Papers

GitHub

GitHub - happinesslz/EPNetV2: EPNet++: Cascade Bi-directional Fusion for Multi-Modal 3D Object Detection (TPAMI-2022) (55 stars)