Deep Continuous Fusion for Multi-Sensor 3D Object Detection (2012.10992v1)

Published 20 Dec 2020 in cs.CV

Abstract: In this paper, we propose a novel 3D object detector that can exploit both LIDAR as well as cameras to perform very accurate localization. Towards this goal, we design an end-to-end learnable architecture that exploits continuous convolutions to fuse image and LIDAR feature maps at different levels of resolution. Our proposed continuous fusion layer encode both discrete-state image features as well as continuous geometric information. This enables us to design a novel, reliable and efficient end-to-end learnable 3D object detector based on multiple sensors. Our experimental evaluation on both KITTI as well as a large scale 3D object detection benchmark shows significant improvements over the state of the art.

Authors (4)

Ming Liang (40 papers)
Bin Yang (320 papers)
Shenlong Wang (70 papers)
Raquel Urtasun (161 papers)

Citations (800)

View on Semantic Scholar

Summary

Deep Continuous Fusion for Multi-Sensor 3D Object Detection: An Expert Overview

The paper "Deep Continuous Fusion for Multi-Sensor 3D Object Detection" presents an innovative approach to 3D object detection that leverages both LIDAR and camera inputs. The central contribution of this work is an end-to-end learnable architecture incorporating continuous convolutions to seamlessly fuse image and LIDAR feature maps across multiple resolution levels. This architecture enables accurate localization and detection of objects, crucial for applications such as autonomous driving.

Methodology

The proposed framework, termed continuous fusion, effectively addresses two core challenges in multi-sensor fusion: the sparsity of LIDAR data and the discrete nature of image features. The heart of this approach is the continuous fusion layer, which ensures dense and accurate fusion by using a multi-layer perceptron (MLP) to interpolate image features onto bird’s eye view (BEV) feature maps derived from LIDAR inputs.

Continuous Fusion Layer

The continuous fusion layer operationalizes a continuous convolution mechanism, extending traditional grid-based convolutions to unstructured data domains inherent in multi-sensor setups. By pooling features from the k-nearest neighbors in BEV space and leveraging geometric offsets for precise fusion, this layer bridges the disparity between sparse LIDAR points and dense image representations.

Network Architecture

The overall architecture includes two parallel streams: one for image features based on ResNet-18 and another for BEV features extracted from voxelized LIDAR data. Fusion occurs at multiple scales within four intermediate layers, ensuring comprehensive multi-resolution integration. Detection headers then produce the final 3D bounding boxes, integrating both classification and regression losses into a unified objective function.

Experimental Evaluation

The approach was rigorously evaluated on the KITTI dataset and a large-scale proprietary dataset named TOR4D. On KITTI, the proposed method showed significant improvements over benchmarks in both BEV and 3D object detection tasks. Specifically, the method achieved state-of-the-art performance with an AP score of 82.54% for easy, 66.22% for moderate, and 64.04% for hard cases in 3D object detection. In BEV detection, similarly strong results underscored the efficacy of the fusion approach.

Performance metrics from TOR4D further validated the method’s scalability and robustness, particularly in long-range detection scenarios. The continuous fusion model surpassed alternatives in detecting vehicles, pedestrians, and bicyclists, achieving higher precision across various distance ranges, a critical feature for autonomous driving systems.

Implications and Future Work

The practical implications of this research are profound, particularly for the field of autonomous vehicles, where the ability to accurately perceive and localize objects enhances navigation safety and efficiency. The continuous fusion mechanism offers a robust alternative to existing fusion strategies, combining the geometrical precision of LIDAR with the detailed context from camera images. The successful integration of high-resolution images, without additional ground-truth image annotations, highlights the scalability of the method.

From a theoretical perspective, this work opens new possibilities in the domain of sensor fusion by demonstrating the power of continuous convolutions. Future research could explore further optimizations to extend the method’s applicability, such as handling dynamic environments more effectively or integrating additional sensor modalities. Expansion to other domains like robotics, where sensor fusion is pivotal, is another promising direction.

In conclusion, "Deep Continuous Fusion for Multi-Sensor 3D Object Detection" presents a sophisticated, scalable solution to 3D object detection. By leveraging continuous convolutions within an end-to-end learnable framework, the approach achieves significant advancements over existing methods, with implications for enhancing autonomous system capabilities across various applications.

PDF Markdown

Related Papers

Find Related Papers