Point Transformer V3: Simpler, Faster, Stronger (2312.10035v2)
Abstract: This paper is not motivated to seek innovation within the attention mechanism. Instead, it focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing, leveraging the power of scale. Drawing inspiration from recent advances in 3D large-scale representation learning, we recognize that model performance is more influenced by scale than by intricate design. Therefore, we present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the overall performance after scaling, such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns. This principle enables significant scaling, expanding the receptive field from 16 to 1024 points while remaining efficient (a 3x increase in processing speed and a 10x improvement in memory efficiency compared with its predecessor, PTv2). PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios. Further enhanced with multi-dataset joint training, PTv3 pushes these results to a higher level.
- Ext5: Towards extreme multi-task scaling for transfer learning. In ICLR, 2022.
- 3d semantic parsing of large-scale indoor spaces. In CVPR, 2016.
- Semantickitti: A dataset for semantic scene understanding of lidar sequences. In ICCV, 2019.
- The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In CVPR, pages 4413–4421, 2018.
- nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
- Emerging properties in self-supervised vision transformers. In CVPR, 2021.
- Multi-view 3d object detection network for autonomous driving. In CVPR, 2017.
- Largekernel3d: Scaling up kernels in 3d sparse cnns. In CVPR, 2023.
- (af)2-s3net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network. In CVPR, 2021.
- A unified point-based framework for 3d segmentation. In 3DV, 2019.
- Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
- 4d spatio-temporal convnets: Minkowski convolutional neural networks. In CVPR, 2019.
- Conditional positional encodings for vision transformers. arXiv:2102.10882, 2021.
- Pointcept Contributors. Pointcept: A codebase for point cloud perception research. https://github.com/Pointcept/Pointcept, 2023.
- 3dmv: Joint 3d-multi-view prediction for 3d semantic scene segmentation. In ECCV, 2018.
- Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv:2307.08691, 2023.
- FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS, 2022.
- Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
- Embracing single stride 3d object detector with sparse transformer. In CVPR, 2022.
- Self-supervised pretraining of visual features in the wild. arXiv:2103.01988, 2021.
- 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, 2018.
- Chao Ma Guangsheng Shi, Ruifeng Li. Pillarnet: Real-time and high-performance pillar-based 3d object detection. ECCV, 2022.
- Pct: Point cloud transformer. Computational Visual Media, 2021.
- Voxel set transformer: A set-to-set approach to 3d object detection from point clouds. In CVPR, 2022.
- Über die stetige abbildung einer linie auf ein flächenstück. Dritter Band: Analysis· Grundlagen der Mathematik· Physik Verschiedenes: Nebst Einer Lebensgeschichte, 1935.
- Exploring data-efficient 3d scene understanding with contrastive scene contexts. In CVPR, 2021.
- Point-to-voxel knowledge distillation for lidar semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022.
- Randla-net: Efficient semantic segmentation of large-scale point clouds. In CVPR, 2020a.
- Jsenet: Joint semantic segmentation and edge detection network for 3d point clouds. In ECCV, 2020b.
- Hierarchical point-edge interaction network for point cloud semantic segmentation. In ICCV, 2019.
- Pointgroup: Dual-set point grouping for 3d instance segmentation. CVPR, 2020.
- Self-supervised pre-training with masked shape prediction for 3d scene understanding. In CVPR, 2023.
- Scaling laws for neural language models. arXiv:2001.08361, 2020.
- Segment anything. In ICCV, 2023.
- Rethinking range view representation for lidar segmentation. In ICCV, 2023.
- Stratified transformer for 3d point cloud segmentation. In CVPR, 2022.
- Spherical transformer for lidar-based 3d recognition. In CVPR, 2023.
- Large-scale point cloud semantic segmentation with superpoint graphs. In CVPR, 2018.
- Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019.
- Seggcn: Efficient 3d point cloud segmentation with fuzzy spherical kernel. In CVPR, 2020.
- Vehicle detection from 3d lidar using fully convolutional network. In RSS, 2016.
- Pointcnn: Convolution on x-transformed points. NeurIPS, 2018.
- Meta architecture for point cloud analysis. In CVPR, pages 17682–17691, 2023.
- Swin transformer: Hierarchical vision transformer using shifted windows. ICCV, 2021.
- Swin transformer v2: Scaling up capacity and resolution. In CVPR, 2022.
- Flatformer: Flattened window attention for efficient point cloud transformer. In CVPR, 2023.
- Rethinking network design and local geometry in point cloud: A simple residual mlp framework. ICLR, 2022.
- Voxnet: A 3d convolutional neural network for real-time object recognition. In IROS, 2015.
- Guy M Morton. A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company New York, 1966.
- Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. In IROS, 2019.
- OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023.
- Masked autoencoders for point cloud self-supervised learning. In ECCV, 2022.
- Fast point transformer. In CVPR, pages 16949–16958, 2022.
- Sur une courbe, qui remplit toute une aire plane. Springer, 1990.
- Using a waffle iron for automotive point cloud semantic segmentation. In ICCV, 2023.
- Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017a.
- Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017b.
- Pointnext: Revisiting pointnet++ with improved training and scaling strategies. NeurIPS, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
- Language-grounded indoor 3d semantic segmentation in the wild. In ECCV, 2022.
- Aditya Sanghi. Info3d: Representation learning on 3d objects using mutual information maximization and contrastive learning. In ECCV, 2020.
- Self-supervised deep learning on point clouds by reconstructing space. In NeurIPS, 2019.
- Semantic scene completion from a single depth image. In CVPR, 2017.
- Multi-view convolutional neural networks for 3d shape recognition. In ICCV, 2015.
- Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020.
- Searching efficient 3d architectures with sparse point-voxel convolution. In ECCV, 2020.
- Tangent convolutions for dense prediction in 3d. In CVPR, 2018.
- Segcloud: Semantic segmentation of 3d point clouds. In 3DV, 2017.
- OpenPCDet Development Team. Openpcdet: An open-source toolbox for 3d object detection from point clouds. https://github.com/open-mmlab/OpenPCDet, 2020.
- Kpconv: Flexible and deformable convolution for point clouds. In ICCV, 2019.
- Divide and contrast: Self-supervised learning from uncurated data. In CVPR, 2021.
- Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023.
- Attention is all you need. In NeurIPS, 2017.
- Graph attention convolution for point cloud semantic segmentation. In CVPR, 2019.
- Peng-Shuai Wang. Octformer: Octree-based transformers for 3D point clouds. SIGGRAPH, 2023.
- O-CNN: Octree-based convolutional neural networks for 3D shape analysis. SIGGRAPH, 36(4), 2017.
- Deep parametric continuous convolutional neural networks. In CVPR, 2018.
- Images speak in images: A generalist painter for in-context visual learning. In CVPR, 2023.
- Deep closest point: Learning representations for point cloud registration. In ICCV, 2019.
- Pointconv: Deep convolutional networks on 3d point clouds. In CVPR, 2019.
- Pointconvformer: Revenge of the point-based convolution. In CVPR, pages 21802–21813, 2023a.
- Point transformer v2: Grouped vector attention and partition-based pooling. In NeurIPS, 2022.
- Towards large-scale 3d representation learning with multi-dataset point prompt training. arXiv:2308.09718, 2023b.
- Masked scene contrast: A scalable framework for unsupervised 3d representation learning. In CVPR, 2023c.
- Efficient streaming language models with attention sinks. arXiv, 2023.
- Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In ECCV, 2020.
- On layer normalization in the transformer architecture. In ICML, 2020.
- Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds. In CVPR, 2021.
- Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling. In CVPR, 2020.
- 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In ECCV, 2022.
- Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
- Modeling point clouds with self-attention and gumbel subset sampling. In CVPR, 2019.
- Swin3d: A pretrained transformer backbone for 3d indoor scene understanding. arXiv:2304.06906, 2023.
- Center-based 3d object detection and tracking. In CVPR, 2021.
- Point-BERT: Pre-training 3D point cloud transformers with masked point modeling. In CVPR, 2022.
- Deep fusionnet for point cloud semantic segmentation. In ECCV, 2020.
- Pointweb: Enhancing local neighborhood features for point cloud processing. In CVPR, 2019.
- Point transformer. In ICCV, 2021.
- Ponderv2: Pave the way for 3d foundataion model with a universal pre-training paradigm. arXiv:2310.08586, 2023.
- Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In CVPR, 2021.
- Xiaoyang Wu (28 papers)
- Li Jiang (88 papers)
- Peng-Shuai Wang (24 papers)
- Zhijian Liu (41 papers)
- Xihui Liu (92 papers)
- Yu Qiao (563 papers)
- Wanli Ouyang (358 papers)
- Tong He (124 papers)
- Hengshuang Zhao (118 papers)