Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net (2012.12395v1)

Published 22 Dec 2020 in cs.CV

Abstract: In this paper we propose a novel deep neural network that is able to jointly reason about 3D detection, tracking and motion forecasting given data captured by a 3D sensor. By jointly reasoning about these tasks, our holistic approach is more robust to occlusion as well as sparse data at range. Our approach performs 3D convolutions across space and time over a bird's eye view representation of the 3D world, which is very efficient in terms of both memory and computation. Our experiments on a new very large scale dataset captured in several north american cities, show that we can outperform the state-of-the-art by a large margin. Importantly, by sharing computation we can perform all tasks in as little as 30 ms.

Citations (604)

View on Semantic Scholar

Summary

The paper introduces a unified deep learning framework that jointly handles 3D detection, tracking, and motion forecasting by processing sensor data in real time.
It employs a bird’s eye view representation and 3D convolutions on spatio-temporal data to enhance detection accuracy and mitigate occlusion challenges.
Experimental results demonstrate that the model outperforms state-of-the-art methods like SSD, achieving high mean average precision at 33 FPS in urban environments.

Fast and Furious: Real Time End-to-End 3D Detection, Tracking and Motion Forecasting with a Single Convolutional Net

The paper presents a comprehensive deep learning framework designed to address the key tasks of 3D detection, tracking, and motion forecasting in autonomous driving scenarios. The proposed model, referred to as Fast and Furious (FaF), leverages a fully convolutional neural network capable of processing 3D sensor data in real time, specifically achieving an impressive performance of 33 frames per second.

Joint 3D Detection, Tracking, and Motion Forecasting

The authors critique the traditional cascade approach in autonomous vehicle systems, which independently learns each component of the task pipeline, lacking efficient error propagation and potentially leading to catastrophic failures. To counter these limitations, the paper introduces an end-to-end model that jointly learns these tasks by utilizing spatio-temporal data from 3D sensors. This holistic approach mitigates issues related to occlusion and data sparsity, common challenges in autonomous driving contexts.

Architecture and Methodology

The FaF model processes input data through a bird's eye view (BEV) representation, allowing it to maintain the integrity of 3D sensor information. The model utilizes 3D convolutions over space and time on a constructed 4D input tensor derived from multiple timeframe point clouds. This innovative approach not only enhances detection accuracy but also significantly improves the prediction and tracking capabilities by exploiting temporal data to estimate vehicle motion trajectories.

Notably, the model implements both early and late fusion strategies for temporal data aggregation, allowing flexibility in balancing computational efficiency and temporal feature complexity. It employs predefined boxes across feature map locations to predict bounding boxes and utilizes a loss function that combines classification and regression terms to minimize discrepancies with ground truth data.

Experimental Results

The paper's experimental evaluation, based on a substantial dataset collected from autonomous vehicles in various North American cities, demonstrates the superiority of FaF over state-of-the-art detection models like SSD and MobileNet. The results indicate a significant performance margin, with FaF achieving high mean average precision (mAP) across multiple Intersection over Union (IoU) thresholds. The model consistently performs well over varying object point densities and demonstrates robustness in detecting distant objects.

Implications and Future Prospects

The proposed method sets a new benchmark for real-time autonomous vehicle perception systems. By effectively integrating detection, tracking, and motion forecasting, it holds significant potential for advancing the safety and reliability of autonomous driving technologies. Additionally, the demonstrated efficiency in harnessing temporal data could be leveraged for further applications in dynamic environments beyond vehicular contexts.

Looking ahead, the paper suggests potential improvements through the integration of techniques such as RoI align for enhanced feature representation and expanding the system's applicability to other object categories, including pedestrians. Furthermore, exploring extended prediction horizons could provide more comprehensive future trajectory estimations.

In conclusion, the FaF network represents a notable stride in the field of autonomous driving research, offering a practical and theoretically sound framework for multi-task neural networks in real-time applications. Its impactful results invite further exploration into integrated perception models and emphasize the growing importance of efficient, end-to-end learning systems in complex, real-world environments.