Spherical Transformer for LiDAR-based 3D Recognition (2303.12766v1)

Published 22 Mar 2023 in cs.CV and cs.AI

Abstract: LiDAR-based 3D point cloud recognition has benefited various applications. Without specially considering the LiDAR point distribution, most current methods suffer from information disconnection and limited receptive field, especially for the sparse distant points. In this work, we study the varying-sparsity distribution of LiDAR points and present SphereFormer to directly aggregate information from dense close points to the sparse distant ones. We design radial window self-attention that partitions the space into multiple non-overlapping narrow and long windows. It overcomes the disconnection issue and enlarges the receptive field smoothly and dramatically, which significantly boosts the performance of sparse distant points. Moreover, to fit the narrow and long windows, we propose exponential splitting to yield fine-grained position encoding and dynamic feature selection to increase model representation ability. Notably, our method ranks 1st on both nuScenes and SemanticKITTI semantic segmentation benchmarks with 81.9% and 74.8% mIoU, respectively. Also, we achieve the 3rd place on nuScenes object detection benchmark with 72.8% NDS and 68.5% mAP. Code is available at https://github.com/dvlab-research/SphereFormer.git.

Authors (5)

Xin Lai (24 papers)
Yukang Chen (43 papers)
Fanbin Lu (5 papers)
Jianhui Liu (14 papers)
Jiaya Jia (162 papers)

Citations (95)

View on Semantic Scholar

Summary

The paper presents SphereFormer, using radial window self-attention to mitigate LiDAR data sparsity and enhance long-range feature aggregation.
It introduces exponential splitting for fine-grained position encoding, significantly improving near-distance representation within spherical windows.
Dynamic feature selection between local and global contexts yields top mIoU scores of 81.9% and 74.8% on nuScenes and SemanticKITTI datasets.

Detailed Analysis of "Spherical Transformer for LiDAR-based 3D Recognition"

The paper "Spherical Transformer for LiDAR-based 3D Recognition" explores addressing the challenges linked with the varying sparsity inherent in LiDAR data. The authors propose a novel attention-based architecture, termed SphereFormer, aimed at improving 3D recognition performance by enhancing the aggregation of long-range information, particularly for sparse, distant points in LiDAR-based point clouds. This approach is a substantial deviation from traditional methods which fail to adequately consider the non-uniform distribution of LiDAR-collected data.

Key Contributions

The notable contributions of this paper are centered around three innovative components:

Radial Window Self-Attention: SphereFormer uses spherical coordinates to partition 3D space into radially oriented windows. This design effectively addresses the issue of limited receptive fields in traditional methods, allowing aggregation of information from a broader range, specifically aiding in discerning sparse, distant points.
Exponential Splitting for Position Encoding: The model introduces exponential splitting to convert relative positions into fine-grained indices for position encoding. With an exponentially increasing splitting interval, this technique improves the representation of near distances, thus maintaining precision in encoding position data within the spherical windows.
Dynamic Feature Selection: Acknowledging the varying information density across different distances from the LiDAR, the framework dynamically selects between local and global features. This ensures that sparse points, which lack local context, benefit from global context aggregation, thereby enhancing the accuracy of recognition tasks.

Experimental Insights

The paper reports that SphereFormer achieves significant advancements over existing benchmarks. On the nuScenes and SemanticKITTI datasets for semantic segmentation, it reaches 1st rank with mIoU scores of 81.9% and 74.8%, respectively. These results highlight the method's effectiveness in not only improving overall performance but especially excelling in scenarios involving distant and sparse point clouds. The proposed method also secures 3rd place on the nuScenes object detection benchmark, further proving the model's versatility and robustness in 3D recognition tasks.

Implications and Future Directions

The implications of SphereFormer are far-reaching, as it potentially sets a new standard in 3D point cloud processing. The spherical attention mechanism could inspire further research into variable-density data interpretation, possibly extending beyond LiDAR to other domains with similar distribution patterns, such as sonar and radar data. Additionally, while the method shows outstanding empirical results, future research might explore computational efficiency and scalability, crucial for real-time applications in autonomous systems and robotics.

The adaptability of SphereFormer as a plugin module presents scope for seamless integration with existing neural architectures, potentially enhancing their performance across a myriad of computer vision tasks beyond 3D recognition.

Conclusion

The paper provides a thorough examination of the limitations in current 3D recognition technologies concerning LiDAR data and presents SphereFormer as a powerful alternative that effectively harnesses the varying sparsity of datasets. By incorporating advanced attention mechanisms tailored to spatial dynamics and proposing innovative solutions for encoding spatial relationships, it opens up new avenues for advancing 3D perception technologies. As a point of intersection between Transformer models and LiDAR-based applications, SphereFormer bridges gaps in perception capabilities, setting a new precedent for further research in the field.

PDF Markdown

Related Papers

GitHub

GitHub - dvlab-research/SphereFormer: The official implementation for "Spherical Transformer for LiDAR-based 3D Recognition" (CVPR 2023). (281 stars)