ScatterFormer: Efficient Voxel Transformer with Scattered Linear Attention (2401.00912v2)
Abstract: Window-based transformers excel in large-scale point cloud understanding by capturing context-aware representations with affordable attention computation in a more localized manner. However, the sparse nature of point clouds leads to a significant variance in the number of voxels per window. Existing methods group the voxels in each window into fixed-length sequences through extensive sorting and padding operations, resulting in a non-negligible computational and memory overhead. In this paper, we introduce ScatterFormer, which to the best of our knowledge, is the first to directly apply attention to voxels across different windows as a single sequence. The key of ScatterFormer is a Scattered Linear Attention (SLA) module, which leverages the pre-computation of key-value pairs in linear attention to enable parallel computation on the variable-length voxel sequences divided by windows. Leveraging the hierarchical structure of GPUs and shared memory, we propose a chunk-wise algorithm that reduces the SLA module's latency to less than 1 millisecond on moderate GPUs. Furthermore, we develop a cross-window interaction module that improves the locality and connectivity of voxel features across different windows, eliminating the need for extensive window shifting. Our proposed ScatterFormer demonstrates 73.8 mAP (L2) on the Waymo Open Dataset and 72.4 NDS on the NuScenes dataset, running at an outstanding detection rate of 23 FPS.The code is available at \href{https://github.com/skyhehe123/ScatterFormer}{https://github.com/skyhehe123/ScatterFormer}.
- Xcit: Cross-covariance image transformers. Advances in neural information processing systems, 34:20014–20027, 2021.
- Transfusion: Robust lidar-camera fusion for 3d object detection with transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1090–1099, 2022.
- Efficientvit: Enhanced linear attention for high-resolution low-computation visual recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
- Mppnet: Multi-frame feature intertwining with proxy points for 3d temporal object detection. In European Conference on Computer Vision, pages 680–697. Springer, 2022.
- Voxelnext: Fully sparse voxelnet for 3d object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21674–21683, 2023.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In Advances in Neural Information Processing Systems, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
- Rangedet: In defense of range view for lidar-based 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2918–2927, 2021.
- Embracing single stride 3d object detector with sparse transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8458–8468, 2022a.
- Fully sparse 3d object detection. Advances in Neural Information Processing Systems, 35:351–363, 2022b.
- M3detr: Multi-representation, multi-scale, mutual-relation 3d object detection with transformers. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 772–782, 2022.
- Pct: Point cloud transformer. Computational Visual Media, 7(2):187–199, 2021.
- Flatten transformer: Vision transformer using focused linear attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5961–5971, 2023.
- Structure aware single-stage 3d object detection from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11873–11882, 2020.
- Voxel set transformer: A set-to-set approach to 3d object detection from point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8417–8427, 2022.
- Msf: Motion-guided sequential fusion for efficient 3d object detection from point cloud sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5196–5205, 2023.
- Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020.
- Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12697–12705, 2019.
- Pillarnext: Rethinking network designs for 3d object detection in lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17567–17576, 2023.
- Lidar r-cnn: An efficient and universal 3d object detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 10012–10022, 2021a.
- Group-free 3d object detection via transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2949–2958, 2021b.
- Flatformer: Flattened window attention for efficient point cloud transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1200–1211, 2023.
- Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3164–3173, 2021.
- An end-to-end transformer model for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2906–2917, 2021.
- Starnet: Targeted computation for object detection in point clouds. arXiv preprint arXiv:1908.11069, 2019.
- 3d object detection with pointformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7463–7472, 2021.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660, 2017a.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pages 5099–5108, 2017b.
- Frustum pointnets for 3d object detection from rgb-d data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 918–927, 2018.
- Deep hough voting for 3d object detection in point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- Scaling transnormer to 175 billion parameters, 2023.
- Improving 3d object detection with channel-wise transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2743–2752, 2021.
- Pillarnet: Real-time and high-performance pillar-based 3d object detection. In European Conference on Computer Vision, pages 35–52. Springer, 2022.
- Pointrcnn: 3d object proposal generation and detection from point cloud. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–779, 2019a.
- Part-a^ 2 net: 3d part-aware and aggregation neural network for object detection from point cloud. arXiv preprint arXiv:1907.03670, 2019b.
- Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Pv-rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. International Journal of Computer Vision, 131(2):531–551, 2023.
- Point-gnn: Graph neural network for 3d object detection in a point cloud. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1711–1719, 2020.
- Rsn: Range sparse net for efficient, accurate lidar 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5725–5734, 2021.
- Swformer: Sparse window transformer for 3d object detection in point clouds. In European Conference on Computer Vision, pages 426–442. Springer, 2022.
- Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
- OpenPCDet Development Team. Openpcdet: An open-source toolbox for 3d object detection from point clouds. https://github.com/open-mmlab/OpenPCDet, 2020.
- Triton: An intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, page 10–19, New York, NY, USA, 2019. Association for Computing Machinery.
- Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning, pages 10347–10357, 2021.
- Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Dsvt: Dynamic sparse voxel transformer with rotated sets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13520–13529, 2023.
- Linformer: Self-attention with linear complexity, 2020.
- Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
- Pixor: Real-time 3d object detection from point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7652–7660, 2018.
- 3d-man: 3d multi-frame attention network for object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1863–1872, 2021.
- Center-based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021.
- Hednet: A hierarchical encoder-decoder network for 3d object detection in point clouds. arXiv preprint arXiv:2310.20234, 2023.
- Pc-rgnn: Point cloud completion and graph neural network for 3d object detection. In Proceedings of the AAAI conference on artificial intelligence, pages 3430–3437, 2021.
- Not all points are equal: Learning highly efficient point-based detectors for 3d lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18953–18962, 2022.
- Point transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16259–16268, 2021.
- Se-ssd: Self-ensembling single-stage object detector from point cloud. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14494–14503, 2021.
- Octr: Octree-based transformer for 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5166–5175, 2023.
- Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4490–4499, 2018.
- Centerformer: Center-based transformer for 3d object detection. In European Conference on Computer Vision, pages 496–513. Springer, 2022.
- Chenhang He (18 papers)
- Ruihuang Li (21 papers)
- Guowen Zhang (8 papers)
- Lei Zhang (1689 papers)