An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds (2007.12392v1)

Published 24 Jul 2020 in cs.CV, cs.LG, and eess.IV

Abstract: Detecting objects in 3D LiDAR data is a core technology for autonomous driving and other robotics applications. Although LiDAR data is acquired over time, most of the 3D object detection algorithms propose object bounding boxes independently for each frame and neglect the useful information available in the temporal domain. To address this problem, in this paper we propose a sparse LSTM-based multi-frame 3d object detection algorithm. We use a U-Net style 3D sparse convolution network to extract features for each frame's LiDAR point-cloud. These features are fed to the LSTM module together with the hidden and memory features from last frame to predict the 3d objects in the current frame as well as hidden and memory features that are passed to the next frame. Experiments on the Waymo Open Dataset show that our algorithm outperforms the traditional frame by frame approach by 7.5% [email protected] and other multi-frame approaches by 1.2% while using less memory and computation per frame. To the best of our knowledge, this is the first work to use an LSTM for 3D object detection in sparse point clouds.

Citations (75)

View on Semantic Scholar

Summary

The paper presents an LSTM-based approach that leverages temporal LiDAR data to enhance 3D object detection accuracy.
It integrates LSTM with a U-Net style 3D sparse convolutional backbone to efficiently process and align multi-frame point clouds.
The method delivers a 7.5% mAP improvement over single-frame models, highlighting its potential for advanced autonomous driving applications.

An LSTM Approach to Temporal 3D Object Detection in LiDAR Point Clouds

The presented paper introduces an innovative approach to address the challenges of 3D object detection in LiDAR point clouds for autonomous driving applications. The primary contribution of the work is the utilization of a Long Short-Term Memory (LSTM) framework to enhance the temporal processing of 3D LiDAR data, thereby improving object detection over conventional frame-by-frame processing methods.

Contextual Background

3D object detection from LiDAR data is crucial for numerous applications, particularly in autonomous vehicles. Traditional methods typically operate on a single-frame basis, which limits their ability to harness temporal information intrinsic to continuous data streams. The advent of publicly available datasets with multi-frame LiDAR sequences, such as the Waymo Open Dataset, invites exploration into multi-frame object detection to potentially increase accuracy and efficiency.

Methodology and Technical Insights

The authors propose a multi-frame detection approach utilizing LSTM networks integrated with a U-Net style 3D sparse convolutional backbone. This architecture allows for extraction of spatial features from each frame of LiDAR data. The LSTM is employed to encapsulate temporal sequences by retaining hidden and memory states across frames, allowing the model to leverage past information in predicting current frame objects.

A distinguishing feature of this method is handling sparse point clouds with sparse convolutional operations within the LSTM framework. The approach includes joint voxelization of current and past frames to align spatial data and ensure efficient memory utilization. The input features, hidden states, and memory states are selectively focused on areas of high object if entities are confidently detected in earlier frames, thereby maintaining a manageable memory footprint.

Numerical Evaluation and Results

The proposed model markedly outperforms both single-frame and other multi-frame counterpart models, as validated on the Waymo Open Dataset. The LSTM-based method yields an improvement of 7.5% in mean Average Precision (mAP) at a threshold of 0.7 IoU over traditional frame-by-frame models and achieves a 1.2% enhancement over a strong baseline involving multi-frame input concatenation.

Implications and Future Directions

This research makes a compelling case for adopting LSTM architectures in processing temporal LiDAR data sequences. By efficiently encapsulating memory and spatial attention mechanisms, the method presented here provides a significant step forward in leveraging temporal data for real-time 3D detection tasks.

Theoretically, this research underscores the benefit of integrating recurrent architectures in 3D spatial domain applications, indicating that temporal dependencies can be captured effectively with a computationally feasible model. From a practical standpoint, these results suggest potential enhancements in the perception systems of autonomous vehicles, improving robustness in dynamic, real-world environments.

Looking forward, the authors suggest enhancements could be achieved by integrating scene flow predictions into the framework to refine memory state transformations further. Additionally, exploration into how LSTMs can universally improve other 3D object detection backbones could expand the applicability and impact of this approach across various neural architectures.

Overall, this paper sets a precedent for subsequent research aiming to leverage temporal dynamics in LiDAR data for improved autonomous navigation and beyond.

PDF Markdown

Related Papers

YouTube

Show All Videos