PointPillars: Fast Encoders for Object Detection from Point Clouds (1812.05784v2)

Published 14 Dec 2018 in cs.LG, cs.CV, and stat.ML

Abstract: Object detection in point clouds is an important aspect of many robotics applications such as autonomous driving. In this paper we consider the problem of encoding a point cloud into a format appropriate for a downstream detection pipeline. Recent literature suggests two types of encoders; fixed encoders tend to be fast but sacrifice accuracy, while encoders that are learned from data are more accurate, but slower. In this work we propose PointPillars, a novel encoder which utilizes PointNets to learn a representation of point clouds organized in vertical columns (pillars). While the encoded features can be used with any standard 2D convolutional detection architecture, we further propose a lean downstream network. Extensive experimentation shows that PointPillars outperforms previous encoders with respect to both speed and accuracy by a large margin. Despite only using lidar, our full detection pipeline significantly outperforms the state of the art, even among fusion methods, with respect to both the 3D and bird's eye view KITTI benchmarks. This detection performance is achieved while running at 62 Hz: a 2 - 4 fold runtime improvement. A faster version of our method matches the state of the art at 105 Hz. These benchmarks suggest that PointPillars is an appropriate encoding for object detection in point clouds.

Citations (3,057)

View on Semantic Scholar

Summary

The paper introduces a novel pillar-based encoder that converts 3D point cloud data into an efficient 2D format for real-time processing.
It achieves an impressive runtime of 16.2 ms (62 Hz) and sets new accuracy benchmarks on KITTI with a BEV mAP of 66.19 and a 3D mAP of 59.20.
Its end-to-end learning framework simplifies sensor pipelines by relying solely on lidar data, paving the way for efficient autonomous driving systems.

An Expert Analysis of "PointPillars: Fast Encoders for Object Detection from Point Clouds"

The paper "PointPillars: Fast Encoders for Object Detection from Point Clouds" by Alex H. Lang et al., proposes a novel encoder architecture for object detection using point clouds in autonomous driving applications. This essay provides an in-depth overview of the contributions, key findings, and broader implications of the proposed PointPillars method.

Introduction to the Method

PointPillars introduces an innovative approach to encoding point clouds by partitioning them into vertical columns, termed "pillars," and learning the feature representation via PointNets. This method contrasts previous works that either rely on fixed encoders, which are less accurate, or on learned encoders, which are computationally intensive and slower.

Technical Contributions

Encoder Design

Pillar-Based Encoding: The point cloud is divided into vertical columns, which simplifies the representation into a 2D format that can be processed by standard 2D convolutional neural networks (CNNs). This design eliminates the need for 3D convolutions, significantly enhancing computational efficiency.
End-to-End Learning: The encoder leverages PointNets to learn the features dynamically from the point cloud data, as opposed to using handcrafted features. This enables the system to generalize better to new scenarios without additional tuning.
Efficiency: All key operations within the PointPillars framework are expressed as 2D convolutions. By exploiting the inherent parallelism of GPUs, this method achieves impressive speeds, orders of magnitude faster than previous state-of-the-art methods such as VoxelNet.

Experimental Validation

PointPillars is rigorously evaluated using the KITTI benchmark datasets, which require the detection of cars, pedestrians, and cyclists in urban environments. The method significantly outperforms existing approaches in both speed and accuracy:

Speed: PointPillars achieves a runtime of approximately 16.2 milliseconds, which translates to 62 Hz. The network's efficiency is particularly suited for real-time applications in autonomous driving.
Accuracy: On the KITTI datasets, PointPillars sets new standards, achieving a BEV (bird’s eye view) mean average precision (mAP) of 66.19 and a 3D mAP of 59.20. The model also demonstrates superior performance in the Average Orientation Similarity (AOS) metric.

Implications and Future Directions

The practical implications of PointPillars are considerable. The method’s speed and accuracy make it suitable for deployment in real-time object detection systems in autonomous vehicles. The encoder’s ability to work solely on lidar point clouds without needing image-based data fusion simplifies the sensor pipeline, reducing system complexity.

On the theoretical front, PointPillars opens avenues for further exploration in end-to-end learning for point cloud data. Given that the architecture can easily adapt to multiple sensor configurations, potential future developments could integrate additional sensor modalities such as radar. Moreover, the design principles of PointPillars could inspire new frameworks in other domains where efficient and accurate 3D object detection is crucial.

Conclusion

PointPillars marks a significant advancement in the field of 3D object detection from lidar point clouds, providing a high-performance, real-time solution suitable for autonomous driving applications. This paper not only presents an effective architecture but also sets a benchmark for future research in efficient point cloud processing. The release of the implementation code further facilitates the replication and extension of this work by the research community.

PDF Markdown

Related Papers

GitHub

GitHub - nutonomy/second.pytorch: PointPillars for KITTI object detection (1,003 stars)

YouTube

Show All Videos