CenterFormer: Center-based Transformer for 3D Object Detection (2209.05588v1)

Published 12 Sep 2022 in cs.CV

Abstract: Query-based transformer has shown great potential in constructing long-range attention in many image-domain tasks, but has rarely been considered in LiDAR-based 3D object detection due to the overwhelming size of the point cloud data. In this paper, we propose CenterFormer, a center-based transformer network for 3D object detection. CenterFormer first uses a center heatmap to select center candidates on top of a standard voxel-based point cloud encoder. It then uses the feature of the center candidate as the query embedding in the transformer. To further aggregate features from multiple frames, we design an approach to fuse features through cross-attention. Lastly, regression heads are added to predict the bounding box on the output center feature representation. Our design reduces the convergence difficulty and computational complexity of the transformer structure. The results show significant improvements over the strong baseline of anchor-free object detection networks. CenterFormer achieves state-of-the-art performance for a single model on the Waymo Open Dataset, with 73.7% mAPH on the validation set and 75.6% mAPH on the test set, significantly outperforming all previously published CNN and transformer-based methods. Our code is publicly available at https://github.com/TuSimple/centerformer

PDF Abstract

CenterFormer: A Center-based Transformer Framework for Enhanced 3D Object Detection in LiDAR Data

The paper "CenterFormer: Center-based Transformer for 3D Object Detection" proposes a novel approach to improve the performance of 3D object detection in LiDAR point clouds by leveraging a transformer-based architecture. The proposed CenterFormer incorporates a center-based transformer network that utilizes a heatmap for center candidate selection on a voxel-based point cloud encoder. This research targets the improvement of both convergence rates and computational efficiency in LiDAR-based object detection tasks.

Summary of Contributions

Key contributions are made in the novel architecture design:

Center-based Transformer Architecture: The paper introduces a center-based transformer architecture, where center candidates derived from a voxel-based point cloud encoder are utilized as the transformer query embeddings. This targets the challenge of efficiently processing large-scale LiDAR data with scattered and sparse points.
Multi-scale Center Proposal Network: A multi-scale approach is used in processing the BEV (bird's eye view) representation of the LiDAR data, allowing for more detailed feature extraction. The authors implement attention mechanisms that efficiently aggregate relevant features, effectively capturing long-range dependencies crucial for accurate object detection.
Cross-attention Layers for Feature Fusion: The integration of multi-frame feature fusion using advanced cross-attention mechanisms demonstrates the potential of the transformer model to harness temporal dependencies across sequential frames effectively. This critically enhances the detection of fast-moving objects.
State-of-the-art Performance: Experimental results underscore the model's efficacy, notably surpassing traditional CNN-based methods as well as previous transformer-based designs. CenterFormer achieves state-of-the-art detection rates with 73.7\% mAPH on the Waymo Open Dataset validation set and 75.6\% mAPH on the testing set.

Analysis and Implications

CenterFormer's reliance on transformer architecture addresses several limitations inherent in both conventional CNN-based and prior transformer-based methods for 3D object detection. Transformers are adept at capturing global context via attention mechanisms, making them suitable for dealing with the irregular distribution and long-range dependencies in LiDAR data. By focusing on center-based detection, the method sidesteps some of the complications associated with anchor-based methods, such as predefined anchors which require extensive hyper-parameter tuning.

The improvements in mAPH (mean average precision weighted by heading) achieved by CenterFormer signal a significant stride toward effectively using multi-scale representations and multi-frame fusion in the challenging domain of 3D object detection. The ability to outperform models across key metrics affirms the theoretical promise of transformer networks in extracting semantic relevance from sparse data.

Future Directions

Although CenterFormer evidences impressive results, possibilities for further improvements exist. The pursuit of more lightweight transformers, particularly for real-time applications, remains an engaging research direction. Additionally, expanding CenterFormer to accommodate other types of sensory data can broaden its applicability in multi-modal perception systems for autonomous vehicles.

The role of transformers in 3D object detection is set to expand given the rapid advances in model optimization and parallel computing environments. It is plausible that transformers will increasingly underpin complex perception systems, improving the robustness and safety of autonomous systems operating in dynamic real-world environments.

Conclusion

The CenterFormer paper substantiates the utility of transformer-based architecture tailored for 3D LiDAR object detection. The methodological innovations and substantive performance improvements exemplified by CenterFormer advocate strongly for the continued exploration of transformer models in 3D perception tasks. The work sets a precursor for further exploration and architectural refinements aiming towards higher efficiency and better integration with real-world autonomous applications. The results and approach together fortify the trajectory of adopting attention mechanisms in parsing complex, high-dimensional LiDAR data.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Zixiang Zhou (22 papers)
Xiangchen Zhao (2 papers)
Yu Wang (939 papers)
Panqu Wang (14 papers)
Hassan Foroosh (48 papers)

Citations (113)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - TuSimple/centerformer: Implementation for CenterFormer: Center-based Transformer for 3D Object Detection (ECCV 2022) (293 stars)