DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets (2301.06051v2)

Published 15 Jan 2023 in cs.CV

Abstract: Designing an efficient yet deployment-friendly 3D backbone to handle sparse point clouds is a fundamental problem in 3D perception. Compared with the customized sparse convolution, the attention mechanism in Transformers is more appropriate for flexibly modeling long-range relationships and is easier to be deployed in real-world applications. However, due to the sparse characteristics of point clouds, it is non-trivial to apply a standard transformer on sparse points. In this paper, we present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception. In order to efficiently process sparse points in parallel, we propose Dynamic Sparse Window Attention, which partitions a series of local regions in each window according to its sparsity and then computes the features of all regions in a fully parallel manner. To allow the cross-set connection, we design a rotated set partitioning strategy that alternates between two partitioning configurations in consecutive self-attention layers. To support effective downsampling and better encode geometric information, we also propose an attention-style 3D pooling module on sparse points, which is powerful and deployment-friendly without utilizing any customized CUDA operations. Our model achieves state-of-the-art performance with a broad range of 3D perception tasks. More importantly, DSVT can be easily deployed by TensorRT with real-time inference speed (27Hz). Code will be available at \url{https://github.com/Haiyang-W/DSVT}.

Authors (8)

Haiyang Wang (47 papers)
Chen Shi (55 papers)
Shaoshuai Shi (39 papers)
Meng Lei (8 papers)
Sen Wang (164 papers)
Di He (108 papers)
Bernt Schiele (210 papers)
Liwei Wang (239 papers)

Citations (89)

View on Semantic Scholar

Summary

Dynamic Sparse Voxel Transformer: An Analysis

The paper introduces the Dynamic Sparse Voxel Transformer (DSVT), a novel Transformer-based architecture aimed at enhancing the efficiency and deployment friendliness of 3D perception tasks in sparse point clouds. The research focuses on overcoming challenges presented by the sparse and irregular nature of 3D point cloud data, commonly used in autonomous driving and robotics.

Key Contributions

Dynamic Sparse Window Attention: The paper proposes a mechanism to handle sparse voxels using a dynamic set partitioning strategy. This approach ensures size-equivalent subsets within each window, allowing for parallel computation without needing customized operations. The rotated set partitioning strategy alternates between configurations across self-attention layers, facilitating enhanced intra-window connections.
3D Pooling Module: An innovative attention-style 3D pooling operation is introduced to downsample sparse voxels effectively. This module aims to encode geometric information without additional CUDA operations, improving practical deployment capabilities.
Transformer Backbone: DSVT serves as an efficient backbone for 3D perception tasks, demonstrating compatibility with well-optimized frameworks like TensorRT. This compatibility translates into state-of-the-art performance across various tasks and datasets while maintaining real-time inference speeds.

Experimental Insights

The research demonstrates DSVT's performance on large-scale datasets such as Waymo and nuScenes. Notable results include:

On Waymo, the single-frame DSVT-V model achieves a significant improvement of 72.1 mAPH on L2 difficulties, outperforming previous one-stage and two-stage methods.
The model maintains superior detection accuracy across multi-frame settings, showcasing its robustness.
On the nuScenes dataset, DSVT achieves top performance with 72.7 test NDS and 68.4 mAP, surpassing existing approaches.

Implications and Future Directions

DSVT's deployment efficiency without custom CUDA operations is a significant advancement, suggesting wide applicability in real-world autonomous systems. Its ability to seamlessly replace existing 3D backbones further signifies its practical relevance. The attention to both theoretical and practical aspects makes DSVT not only a performant architecture but also a deployable solution for industry applications.

Future work might explore extending DSVT's capabilities to more general-purpose 3D applications beyond outdoor perception, adapting its components to various data distributions. Additionally, investigating the integration with multi-modal systems and further optimizing inference speeds could be potential avenues for research.

In summary, the Dynamic Sparse Voxel Transformer presents a significant step in aligning advanced 3D deep learning techniques with real-world deployment needs, balancing performance and practicality effectively.

PDF Markdown

Related Papers

GitHub

GitHub - Haiyang-W/DSVT: [CVPR2023] Official Implementation of "DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets" (339 stars)