Mono-Camera 3D Multi-Object Tracking Using Deep Learning Detections and PMBM Filtering (1802.09975v1)

Published 27 Feb 2018 in cs.CV and eess.SP

Abstract: Monocular cameras are one of the most commonly used sensors in the automotive industry for autonomous vehicles. One major drawback using a monocular camera is that it only makes observations in the two dimensional image plane and can not directly measure the distance to objects. In this paper, we aim at filling this gap by developing a multi-object tracking algorithm that takes an image as input and produces trajectories of detected objects in a world coordinate system. We solve this by using a deep neural network trained to detect and estimate the distance to objects from a single input image. The detections from a sequence of images are fed in to a state-of-the art Poisson multi-Bernoulli mixture tracking filter. The combination of the learned detector and the PMBM filter results in an algorithm that achieves 3D tracking using only mono-camera images as input. The performance of the algorithm is evaluated both in 3D world coordinates, and 2D image coordinates, using the publicly available KITTI object tracking dataset. The algorithm shows the ability to accurately track objects, correctly handle data associations, even when there is a big overlap of the objects in the image, and is one of the top performing algorithms on the KITTI object tracking benchmark. Furthermore, the algorithm is efficient, running on average close to 20 frames per second.

Citations (147)

View on Semantic Scholar

Summary

The paper introduces a novel method for 3D multi-object tracking from monocular cameras by combining deep learning-based detection and distance estimation with Poisson multi-Bernoulli mixture (PMBM) filtering for tracking.
The technical approach utilizes a convolutional neural network for 2D object detection and depth estimation from single images and employs the PMBM filter to handle complex data association for robust 3D trajectory estimation.
Experimental results on the KITTI dataset show the algorithm achieves competitive 2D tracking performance with low identity switches and near real-time processing, demonstrating the effectiveness of the combined detection and filtering framework.

Overview of Mono-Camera 3D Multi-Object Tracking Using Deep Learning Detections and PMBM Filtering

The paper "Mono-Camera 3D Multi-Object Tracking Using Deep Learning Detections and PMBM Filtering" introduces a novel approach to multi-object tracking (MOT), specifically focusing on processing data from monocular cameras. These cameras are prevalent in the automotive industry, particularly for autonomous vehicle systems. The core challenge addressed in the paper is the inherent limitation of monocular cameras, which cannot directly capture 3D spatial information. The authors propose a solution which leverages deep neural networks for object detection and Poisson multi-Bernoulli mixture (PMBM) filtering for tracking, ultimately enabling 3D object trajectory estimation from 2D images.

Technical Approach

The algorithm consists of two principal components:

Object Detection:
- A convolutional neural network (CNN) is employed to process each image frame, outputting detections that include both bounding boxes and distance estimations for each detected object. The network architecture builds upon the DRN-C-26 model but lacks its final classification layers, as its task is distance estimation, not classification.
- Object distances are estimated using a specifically trained CNN leveraging annotated lidar data. The detection network utilizes Soft-NMS to improve bounding box proposals, with results refined using cross-entropy and smooth L1 loss functions.
Object Tracking:
- The PMBM filter receives detection data and estimates the multi-object distribution over the tracking sequence. It handles data association complexities, such as assigning detections to either new or existing trajectories, and deals with false alarms and missed detections effectively.
- The object tracking filter uses Random Finite Set (RFS) frameworks which consist of Point Object and Multi-Object models, integrating PPP and Bernoulli processes for modeling new object appearances and previously detected objects, respectively.

Experimental Results

The algorithm's performance was tested using the KITTI tracking dataset, which provides a benchmark for evaluating automotive object detection and tracking algorithms. The paper reports competitive results, particularly in handling complex scenarios with overlapping objects. The algorithm achieves near real-time performance, processing at about 20 frames per second. In 2D MOT metrics, its effectiveness is comparable to leading methods, but it stands out in the low number of identity switches, showcasing robustness in data association.

Implications and Future Work

From a practical standpoint, this research advances the capability of monocular camera systems to perform MOT in three-dimensional spaces, enhancing situational awareness for autonomous vehicles. The theoretical merit lies in the synthesis of deep learning detections with statistical multi-object tracking models like PMBM, encapsulating advancements in both fields.

Future developments could focus on further refining distance estimation, possibly integrating sensor fusion techniques if additional sensor data becomes available (e.g., radar or stereo cameras). Also, the extension of this work to account for more complex dynamic environments, such as urban driving scenarios with more diverse object classes like pedestrians and cyclists, could be fruitful. Improving model generalizability across different operational conditions, such as varying lighting and weather, remains another technical challenge worth addressing.

In sum, this paper details a significant step toward efficient and effective 3D MOT using monocular cameras, striking a balance between detection accuracy and tracking reliability within computational constraints.

PDF Markdown

Related Papers

YouTube

Show All Videos