Embracing Single Stride 3D Object Detector with Sparse Transformer (2112.06375v1)

Published 13 Dec 2021 in cs.CV

Abstract: In LiDAR-based 3D object detection for autonomous driving, the ratio of the object size to input scene size is significantly smaller compared to 2D detection cases. Overlooking this difference, many 3D detectors directly follow the common practice of 2D detectors, which downsample the feature maps even after quantizing the point clouds. In this paper, we start by rethinking how such multi-stride stereotype affects the LiDAR-based 3D object detectors. Our experiments point out that the downsampling operations bring few advantages, and lead to inevitable information loss. To remedy this issue, we propose Single-stride Sparse Transformer (SST) to maintain the original resolution from the beginning to the end of the network. Armed with transformers, our method addresses the problem of insufficient receptive field in single-stride architectures. It also cooperates well with the sparsity of point clouds and naturally avoids expensive computation. Eventually, our SST achieves state-of-the-art results on the large scale Waymo Open Dataset. It is worth mentioning that our method can achieve exciting performance (83.8 LEVEL 1 AP on validation split) on small object (pedestrian) detection due to the characteristic of single stride. Codes will be released at https://github.com/TuSimple/SST

Authors (8)

Lue Fan (26 papers)
Ziqi Pang (16 papers)
Tianyuan Zhang (46 papers)
Yu-Xiong Wang (87 papers)
Hang Zhao (156 papers)
Feng Wang (409 papers)
Naiyan Wang (65 papers)
Zhaoxiang Zhang (162 papers)

Citations (223)

View on Semantic Scholar

Summary

The paper introduces a novel SST that bypasses conventional multi-stride downsampling to preserve fine details in 3D LiDAR data.
It replaces traditional convolutions with attention-based mechanisms, achieving a pedestrian detection AP of 83.8 on the Waymo dataset.
The method is both scalable and computationally efficient, making it highly applicable for real-world autonomous driving scenarios.

Overview of "Embracing Single Stride 3D Object Detector with Sparse Transformer"

The paper explores innovative approaches in LiDAR-based 3D object detection for autonomous driving, which traditionally extends methodologies from 2D detection systems. The authors address the inefficiencies and limitations introduced by downsampling operations prevalent in multi-stride architectures. This work proposes a novel Single-stride Sparse Transformer (SST) that maintains the original resolution of the input data throughout the network, aiming to enhance detection capabilities for inherently small objects like pedestrians.

Key Contributions

Reevaluation of Downsampling Practices: Traditional 3D detection systems adopt multi-scale architectures from 2D standards, involving significant downsampling which leads to information loss. The authors argue against the necessity of this step, showing that single-stride architectures can suffice in 3D space due to the relatively smaller scale of objects compared to their environment.
Single-stride Sparse Transformer (SST): SST is introduced as a robust alternative, replacing convolutions with attention mechanisms tailored for sparse data representations. By utilizing regional grouping and Sparse Regional Attention (SRA), SST preserves resolution and mitigates excessive computational demands.
Performance and Efficiency: SST achieves state-of-the-art results on the Waymo Open Dataset, notably enhancing detection accuracy for small objects, such as pedestrians. The technique exhibits a Level 1 average precision (AP) of 83.8 in pedestrian detection, which outperforms existing models.

Methodological Innovations

Attention Mechanisms Over Convolutional Layers: The shift from convolutional to attention-based networks allows SST to handle large receptive fields efficiently without the computational burden typical of high-resolution convolutions in large-scale 3D detection tasks.
Regional Grouping and Sparse Regional Attention: These strategies finely balance computational load and receptive field expansion. Local regions within the voxel space are established to apply self-attention efficiently, addressing scale and sparsity challenges.
Adaptability and Integration with Existing Architectures: SST is designed to integrate seamlessly with existing detector heads and accommodate multi-frame data, enhancing its practical utility in real-world autonomous driving systems.

Empirical Findings and Implications

Empirical Validation: Through comparative experiments, SST demonstrates superior performance, particularly in scenarios requiring fine detail resolution, such as the detection of small and distantly positioned pedestrians in sparse point clouds.
Scalable and Modular: The architecture's modular nature allows for scalable improvements and adaptations to various 3D object detection scenarios, suggesting potential applications beyond autonomous driving.
Computational Efficiency: Despite the retention of high-resolution data throughout processing stages, the approach remains computationally feasible on platforms like the 2080Ti GPU, highlighting its potential applicability across a range of processing constraints.

Future Directions

Enhanced Second-Stage Systems: While SST presents considerable advancements in single-stage detection, incorporating more sophisticated second-stage networks could bolster performance.
Integration with Advanced Transformer Techniques: Incorporating contemporary advancements in transformer architectures could further improve the SST's efficiency and effectiveness.

In summary, this paper significantly contributes to the field of 3D object detection by challenging conventional multi-stride methodologies and leveraging the unique properties of transformer models tailored for sparse data. The proposed SST offers a compelling approach that balances computational practicality with enhanced detection accuracy, particularly suited for small-scale object detection in dense environments. Future work could explore harmonizing these innovations with emerging transformer techniques and second-stage detection enhancements.

PDF Markdown

Related Papers

GitHub

GitHub - tusen-ai/SST: Code for a series of work in LiDAR perception, including SST (CVPR 22), FSD (NeurIPS 22), FSD++ (TPAMI 23), FSDv2, and CTRL (ICCV 23, oral). (788 stars)