Efficient Point Transformer with Dynamic Token Aggregating for Point Cloud Processing (2405.15827v1)
Abstract: Recently, point cloud processing and analysis have made great progress due to the development of 3D Transformers. However, existing 3D Transformer methods usually are computationally expensive and inefficient due to their huge and redundant attention maps. They also tend to be slow due to requiring time-consuming point cloud sampling and grouping processes. To address these issues, we propose an efficient point TransFormer with Dynamic Token Aggregating (DTA-Former) for point cloud representation and processing. Firstly, we propose an efficient Learnable Token Sparsification (LTS) block, which considers both local and global semantic information for the adaptive selection of key tokens. Secondly, to achieve the feature aggregation for sparsified tokens, we present the first Dynamic Token Aggregating (DTA) block in the 3D Transformer paradigm, providing our model with strong aggregated features while preventing information loss. After that, a dual-attention Transformer-based Global Feature Enhancement (GFE) block is used to improve the representation capability of the model. Equipped with LTS, DTA, and GFE blocks, DTA-Former achieves excellent classification results via hierarchical feature learning. Lastly, a novel Iterative Token Reconstruction (ITR) block is introduced for dense prediction whereby the semantic features of tokens and their semantic relationships are gradually optimized during iterative reconstruction. Based on ITR, we propose a new W-net architecture, which is more suitable for Transformer-based feature learning than the common U-net design. Extensive experiments demonstrate the superiority of our method. It achieves SOTA performance with up to 30$\times$ faster than prior point Transformers on ModelNet40, ShapeNet, and airborne MultiSpectral LiDAR (MS-LiDAR) datasets.
- Lidar point cloud compression, processing and learning for autonomous driving. IEEE Trans. Intell. Transport. Syst. 24, 962–979.
- Slic superpixels compared to state-of-the-art superpixel methods. IEEE Trans. Pattern Anal. Mach. Intell. 34, 2274–2282.
- ETC: encoding long and structured inputs in transformers, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 268–284. doi:10.18653/V1/2020.EMNLP-MAIN.19.
- Adaptive coarse-to-fine clustering and terrain feature-aware-based method for reducing liDAR terrain point clouds. ISPRS J. Photogramm. Remote Sens. 200, 89–105.
- Crackembed: Point feature embedding for crack segmentation from disaster site point clouds with anomaly detection. Adv. Eng. Inform. 52, 101550. doi:10.1016/J.AEI.2022.101550.
- A novel radar point cloud generation method for robot environment perception. IEEE Trans. Robot. 38, 3754–3773.
- Transformer-XL: Attentive language models beyond a fixed-length context, in: Proceedings of the 57th Conference of the Association for Computational Linguistics, pp. 2978–2988. doi:10.18653/V1/P19-1285.
- Point transformer. IEEE Access 9, 134826–134840.
- Lft-net: Local feature transformer network for point clouds analysis. IEEE Trans. Intell. Transport. Syst. doi:10.1109/TITS.2022.3140355.
- Mctnet: Multiscale cross-attention-based transformer network for semantic segmentation of large-scale point cloud. IEEE IEEE Trans Geosci Remote Sens. 61, 1–20. doi:10.1109/TGRS.2023.3322579.
- PCT: Point cloud transformer. Comput. Vis. Media. 7, 187–199.
- 3CROSSNet: Cross-level cross-scale cross-attention network for point cloud representation. IEEE Robotics Autom. Lett. 7, 3718–3725.
- Residual learning with annularly convolutional neural networks for classification and segmentation of 3D point clouds. Neurocomputing 526, 96–108.
- Randla-net: Efficient semantic segmentation of large-scale point clouds, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 11108–11117.
- Pyramid point cloud transformer for large-scale place recognition, in: Proc. IEEE Int. Conf. Comput. Vis., pp. 6098–6107.
- Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 .
- Multispectral LiDAR point cloud classification using SE-PointNet++. Remote Sens. 13, 2516. doi:10.3390/RS13132516.
- Convolutional point transformer, in: Asian Conf. Comput. Vis., pp. 303–319.
- Stratified transformer for 3D point cloud segmentation, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 8500–8509.
- Large-scale point cloud semantic segmentation with superpoint graphs, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 4558–4567.
- Deepgcns: Can gcns go as deep as cnns?, in: Proc. IEEE Int. Conf. Comput. Vis., pp. 9267–9276.
- DeepGCNs: Making GCNs go as deep as CNNs. IEEE Trans. Pattern Anal. Mach. Intell. 45, 6923–6939. doi:10.1109/TPAMI.2021.3074057.
- Gl-Net: Semantic segmentation for point clouds of shield tunnel via global feature learning and local feature discriminative aggregation. ISPRS J. Photogramm. Remote Sens. 199, 335–349.
- PointCNN: Convolution on X-transformed points, in: Proc. Adv. Neural Inf. Process. Syst., pp. 820–830.
- Three-dimensional point cloud segmentation based on context feature for sheet metal part boundary recognition. Trans. Instrum. Meas. doi:10.1109/TIM.2023.3272047.
- Semantic segmentation of bridge components and road infrastructure from mobile lidar data. ISPRS J. Photogramm. Remote Sens. 6, 100023.
- Relation-shape convolutional neural network for point cloud analysis, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 8895–8904.
- Point cloud classification using content-based transformer via clustering in feature space. IEEE-CAA J. Automatica Sin. doi:10.1109/JAS.2023.123432.
- Swin transformer: Hierarchical vision transformer using shifted windows, in: Proc. IEEE Int. Conf. Comput. Vis., pp. 9992–10002.
- Flatformer: Flattened window attention for efficient point cloud transformer, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1200–1211.
- 3dctn: 3d convolution-transformer network for point cloud classification. IEEE Intell. Transp. Syst. 23, 24854–24865.
- ∞\infty∞-former: Infinite memory transformer, in: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, pp. 5468–5485. doi:10.18653/V1/2022.ACL-LONG.375.
- Efficient transformers with dynamic token pooling, in: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pp. 6403–6417. doi:10.18653/V1/2023.ACL-LONG.353.
- Fast point transformer, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 16949–16958.
- PointNet: Deep learning on point sets for 3D classification and segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660.
- PointNet++: Deep hierarchical feature learning on point sets in a metric space, in: Proc. Adv. Neural Inf. Process. Syst., pp. 5099–5108.
- PointNeXt: Revisiting PointNet++ with improved training and scaling strategies. arXiv:2206.04670 URL: http://arxiv.org/abs/2206.04670.
- Geometric back-projection network for point cloud classification. IEEE Trans Multimedia 24, 1943–1955.
- Surface representation for point clouds, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 18942–18952.
- Efficient 3d semantic segmentation with superpoint transformer. arXiv:2306.08045 URL: http://arxiv.org/abs/2306.08045.
- A training dataset for semantic segmentation of urban point cloud map for intelligent vehicles. ISPRS J. Photogramm. Remote Sens. 187, 159–170.
- Superpoint transformer for 3D scene instance segmentation, in: AAAI Conf. Artif. Intell., pp. 2393–2401.
- Efficient transformers: A survey. ACM Comput. Surv. 55, 109:1–109:28. doi:10.1145/3530811.
- Dsvt: Dynamic sparse voxel transformer with rotated sets, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 13520–13529.
- Graph attention convolution for point cloud semantic segmentation, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 10296–10305.
- Dynamic graph CNN for learning on point clouds. ACM Trans. Graph. 38, 1–12. doi:10.1145/3326362.
- Dynamic graph attention networks for point cloud landslide segmentation. Int J Appl Earth Obs Geoinf. 124, 103542.
- Attention-based point cloud edge sampling, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 5333–5343.
- 3d shapenets: A deep representation for volumetric shapes, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 1912–1920.
- Multispectral liDAR point cloud segmentation for land cover leveraging semantic fusion in deep learning network. Remote Sens. 15, 243.
- PointASNL: Robust point clouds processing using nonlocal neural networks with adaptive sampling, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 5589–5598.
- A scalable active framework for region annotation in 3d shape collections. ACM Trans. Graph. 35, 1–12.
- Patchformer: An efficient point transformer with patch attention, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 11799–11808.
- Semantic segmentation of spectral lidar point clouds based on neural architecture search. IEEE Trans. Geosci. Remote Sens. doi:10.1109/TGRS.2023.3284995.
- Starting from non-parametric networks for 3D point cloud analysis, in: Proc. IEEE Conf. Comput. Vis. Pattern Recognit., pp. 5344–5353.
- Introducing improved transformer to land cover classification using multispectral lidar point clouds. Remote Sens. 14, 3808.
- Point transformer, in: Proc. IEEE Int. Conf. Comput. Vis., pp. 16259–16268.
- Airborne multispectral lidar point cloud classification with a feature reasoning-based graph convolution network. Int J Appl Earth Obs Geoinf. 105, 102634.
- Adaptive graph convolution for point cloud analysis, in: Proc. IEEE Int. Conf. Comput. Vis., pp. 4965–4974.