PIXOR: Real-time 3D Object Detection from Point Clouds (1902.06326v3)

Published 17 Feb 2019 in cs.CV

Abstract: We address the problem of real-time 3D object detection from point clouds in the context of autonomous driving. Computation speed is critical as detection is a necessary component for safety. Existing approaches are, however, expensive in computation due to high dimensionality of point clouds. We utilize the 3D data more efficiently by representing the scene from the Bird's Eye View (BEV), and propose PIXOR, a proposal-free, single-stage detector that outputs oriented 3D object estimates decoded from pixel-wise neural network predictions. The input representation, network architecture, and model optimization are especially designed to balance high accuracy and real-time efficiency. We validate PIXOR on two datasets: the KITTI BEV object detection benchmark, and a large-scale 3D vehicle detection benchmark. In both datasets we show that the proposed detector surpasses other state-of-the-art methods notably in terms of Average Precision (AP), while still runs at >28 FPS.

Authors (3)

Bin Yang (320 papers)
Wenjie Luo (19 papers)
Raquel Urtasun (161 papers)

Citations (1,033)

View on Semantic Scholar

Summary

The paper's main contribution is a single-stage, proposal-free architecture that leverages a bird's eye view representation for efficient 3D detection.
It employs a fully convolutional network with multi-task predictions using focal and modified smooth L1 losses to enhance accuracy and recall.
PIXOR achieves over 28 FPS and superior Average Precision on KITTI benchmarks, making it highly relevant for real-time autonomous driving applications.

Real-time 3D Object Detection from Point Clouds with PIXOR

The paper "PIXOR: Real-time 3D Object Detection from Point Clouds" by Bin Yang, Wenjie Luo, and Raquel Urtasun addresses the critical problem of real-time 3D object detection within the context of autonomous driving. As detection fidelity and computational efficiency are paramount for safety applications, this paper innovates with a novel approach that emphasizes speed without compromising accuracy.

Key Contributions

The core contributions of the paper can be summarized as follows:

Bird's Eye View (BEV) Representation: PIXOR leverages BEV representation of LIDAR point clouds, making the most of its computational benefits while preserving metric space. This allows the network to exploit priors about object size and shape effectively.
Single-Stage, Proposal-Free Architecture: PIXOR distinguishes itself by being a single-stage, proposal-free object detector. This design simplifies the architecture and improves computational efficiency. Unlike proposal-based methods, PIXOR conducts dense pixel-wise object prediction which inherently ensures a high recall rate.
Network Architecture Design: The authors propose a fully convolutional network featuring a backbone and a header network. The backbone extracts feature representations, while the header handles multi-task predictions including object classification and precise localization in 3D space.
Efficient Inference: Demonstrating that the approach can maintain high frame rates, PIXOR is capable of running over 28 FPS, making it highly suitable for real-time applications in self-driving systems.

Analytical Insights

Bird’s Eye View (BEV) Representation

The adoption of BEV for point cloud processing arises from the inefficiency of 3D voxel grids which capture redundant, sparse data points. BEV maintains computational efficiency by reducing the data to a 2D plane without significant information loss—critical for real-time performance. The input representation is compact, involving features like occupancy and intensity, which align well with the needs for object detection in autonomous driving.

Network and Loss Functions

The detection pipeline does not employ the traditional object proposals, which substantially reduces computational overhead. The absence of predefined object anchors simplifies the implementation and avoids hyperparameter tuning. The network design—composed of a backbone leveraging residual blocks and a top-down pathway for feature up-sampling—ensures robustness in representation.

Interestingly, the loss function integrates a modified smooth L1 loss for regression and a focal loss for classification to handle class imbalance. The introduction of a "decoding loss" during fine-tuning further enhances the model’s performance by directly optimizing box coordinates.

Empirical Results

PIXOR demonstrates superior performance on the KITTI BEV object detection benchmark with the highest Average Precision (AP) among state-of-the-art methods. Beating the closest competitor by a significant margin, PIXOR also exhibits better performance, particularly in long-range detection scenarios, which is crucial for autonomous vehicles.

Table \ref{tab:object_kitti} and Figure \ref{fig:fine_kitti} from the paper underpin PIXOR’s consistent superiority across varied IoU thresholds and operational ranges, reinforcing its efficacy and robustness.

Practical and Theoretical Implications

From an applied perspective, PIXOR’s real-time capability ensures it can be readily integrated into the perception stack of autonomous vehicles. Its efficiency allows for deployment in embedded systems with computational constraints.

Theoretically, the paper paves the way for further simplifications in object detection architectures by eliminating intermediaries like region proposals. It encourages the exploration of efficient representations and end-to-end optimization strategies that could generalize to other domains of 3D object detection.

Future Developments

The results encourage a deeper dive into handling more complex environments and diverse object categories. Future works could explore enhancements in decoding processes, tailored loss functions, and hybrid architectures that might combine the strengths of both proposal-free and proposal-based systems.

Concluding, the introduction and validation of PIXOR stand as notable advancements in the field of real-time 3D object detection from point clouds. The balance of computational efficiency and detection accuracy signals significant progress in optimizing perception systems for autonomous driving applications.

PDF Markdown