Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Point Transformer V3: Simpler, Faster, Stronger (2312.10035v2)

Published 15 Dec 2023 in cs.CV

Abstract: This paper is not motivated to seek innovation within the attention mechanism. Instead, it focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing, leveraging the power of scale. Drawing inspiration from recent advances in 3D large-scale representation learning, we recognize that model performance is more influenced by scale than by intricate design. Therefore, we present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the overall performance after scaling, such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns. This principle enables significant scaling, expanding the receptive field from 16 to 1024 points while remaining efficient (a 3x increase in processing speed and a 10x improvement in memory efficiency compared with its predecessor, PTv2). PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios. Further enhanced with multi-dataset joint training, PTv3 pushes these results to a higher level.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (102)
  1. Ext5: Towards extreme multi-task scaling for transfer learning. In ICLR, 2022.
  2. 3d semantic parsing of large-scale indoor spaces. In CVPR, 2016.
  3. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In ICCV, 2019.
  4. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In CVPR, pages 4413–4421, 2018.
  5. nuscenes: A multimodal dataset for autonomous driving. In CVPR, 2020.
  6. Emerging properties in self-supervised vision transformers. In CVPR, 2021.
  7. Multi-view 3d object detection network for autonomous driving. In CVPR, 2017.
  8. Largekernel3d: Scaling up kernels in 3d sparse cnns. In CVPR, 2023.
  9. (af)2-s3net: Attentive feature fusion with adaptive feature selection for sparse semantic segmentation network. In CVPR, 2021.
  10. A unified point-based framework for 3d segmentation. In 3DV, 2019.
  11. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  12. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In CVPR, 2019.
  13. Conditional positional encodings for vision transformers. arXiv:2102.10882, 2021.
  14. Pointcept Contributors. Pointcept: A codebase for point cloud perception research. https://github.com/Pointcept/Pointcept, 2023.
  15. 3dmv: Joint 3d-multi-view prediction for 3d semantic scene segmentation. In ECCV, 2018.
  16. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017.
  17. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv:2307.08691, 2023.
  18. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. In NeurIPS, 2022.
  19. Cswin transformer: A general vision transformer backbone with cross-shaped windows. In CVPR, 2022.
  20. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  21. Embracing single stride 3d object detector with sparse transformer. In CVPR, 2022.
  22. Self-supervised pretraining of visual features in the wild. arXiv:2103.01988, 2021.
  23. 3d semantic segmentation with submanifold sparse convolutional networks. In CVPR, 2018.
  24. Chao Ma Guangsheng Shi, Ruifeng Li. Pillarnet: Real-time and high-performance pillar-based 3d object detection. ECCV, 2022.
  25. Pct: Point cloud transformer. Computational Visual Media, 2021.
  26. Voxel set transformer: A set-to-set approach to 3d object detection from point clouds. In CVPR, 2022.
  27. Über die stetige abbildung einer linie auf ein flächenstück. Dritter Band: Analysis· Grundlagen der Mathematik· Physik Verschiedenes: Nebst Einer Lebensgeschichte, 1935.
  28. Exploring data-efficient 3d scene understanding with contrastive scene contexts. In CVPR, 2021.
  29. Point-to-voxel knowledge distillation for lidar semantic segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022.
  30. Randla-net: Efficient semantic segmentation of large-scale point clouds. In CVPR, 2020a.
  31. Jsenet: Joint semantic segmentation and edge detection network for 3d point clouds. In ECCV, 2020b.
  32. Hierarchical point-edge interaction network for point cloud semantic segmentation. In ICCV, 2019.
  33. Pointgroup: Dual-set point grouping for 3d instance segmentation. CVPR, 2020.
  34. Self-supervised pre-training with masked shape prediction for 3d scene understanding. In CVPR, 2023.
  35. Scaling laws for neural language models. arXiv:2001.08361, 2020.
  36. Segment anything. In ICCV, 2023.
  37. Rethinking range view representation for lidar segmentation. In ICCV, 2023.
  38. Stratified transformer for 3d point cloud segmentation. In CVPR, 2022.
  39. Spherical transformer for lidar-based 3d recognition. In CVPR, 2023.
  40. Large-scale point cloud semantic segmentation with superpoint graphs. In CVPR, 2018.
  41. Pointpillars: Fast encoders for object detection from point clouds. In CVPR, 2019.
  42. Seggcn: Efficient 3d point cloud segmentation with fuzzy spherical kernel. In CVPR, 2020.
  43. Vehicle detection from 3d lidar using fully convolutional network. In RSS, 2016.
  44. Pointcnn: Convolution on x-transformed points. NeurIPS, 2018.
  45. Meta architecture for point cloud analysis. In CVPR, pages 17682–17691, 2023.
  46. Swin transformer: Hierarchical vision transformer using shifted windows. ICCV, 2021.
  47. Swin transformer v2: Scaling up capacity and resolution. In CVPR, 2022.
  48. Flatformer: Flattened window attention for efficient point cloud transformer. In CVPR, 2023.
  49. Rethinking network design and local geometry in point cloud: A simple residual mlp framework. ICLR, 2022.
  50. Voxnet: A 3d convolutional neural network for real-time object recognition. In IROS, 2015.
  51. Guy M Morton. A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company New York, 1966.
  52. Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. In IROS, 2019.
  53. OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2023.
  54. Masked autoencoders for point cloud self-supervised learning. In ECCV, 2022.
  55. Fast point transformer. In CVPR, pages 16949–16958, 2022.
  56. Sur une courbe, qui remplit toute une aire plane. Springer, 1990.
  57. Using a waffle iron for automotive point cloud semantic segmentation. In ICCV, 2023.
  58. Pointnet: Deep learning on point sets for 3d classification and segmentation. In CVPR, 2017a.
  59. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NeurIPS, 2017b.
  60. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. NeurIPS, 2022.
  61. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
  62. Language-grounded indoor 3d semantic segmentation in the wild. In ECCV, 2022.
  63. Aditya Sanghi. Info3d: Representation learning on 3d objects using mutual information maximization and contrastive learning. In ECCV, 2020.
  64. Self-supervised deep learning on point clouds by reconstructing space. In NeurIPS, 2019.
  65. Semantic scene completion from a single depth image. In CVPR, 2017.
  66. Multi-view convolutional neural networks for 3d shape recognition. In ICCV, 2015.
  67. Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020.
  68. Searching efficient 3d architectures with sparse point-voxel convolution. In ECCV, 2020.
  69. Tangent convolutions for dense prediction in 3d. In CVPR, 2018.
  70. Segcloud: Semantic segmentation of 3d point clouds. In 3DV, 2017.
  71. OpenPCDet Development Team. Openpcdet: An open-source toolbox for 3d object detection from point clouds. https://github.com/open-mmlab/OpenPCDet, 2020.
  72. Kpconv: Flexible and deformable convolution for point clouds. In ICCV, 2019.
  73. Divide and contrast: Self-supervised learning from uncurated data. In CVPR, 2021.
  74. Llama: Open and efficient foundation language models. arXiv:2302.13971, 2023.
  75. Attention is all you need. In NeurIPS, 2017.
  76. Graph attention convolution for point cloud semantic segmentation. In CVPR, 2019.
  77. Peng-Shuai Wang. Octformer: Octree-based transformers for 3D point clouds. SIGGRAPH, 2023.
  78. O-CNN: Octree-based convolutional neural networks for 3D shape analysis. SIGGRAPH, 36(4), 2017.
  79. Deep parametric continuous convolutional neural networks. In CVPR, 2018.
  80. Images speak in images: A generalist painter for in-context visual learning. In CVPR, 2023.
  81. Deep closest point: Learning representations for point cloud registration. In ICCV, 2019.
  82. Pointconv: Deep convolutional networks on 3d point clouds. In CVPR, 2019.
  83. Pointconvformer: Revenge of the point-based convolution. In CVPR, pages 21802–21813, 2023a.
  84. Point transformer v2: Grouped vector attention and partition-based pooling. In NeurIPS, 2022.
  85. Towards large-scale 3d representation learning with multi-dataset point prompt training. arXiv:2308.09718, 2023b.
  86. Masked scene contrast: A scalable framework for unsupervised 3d representation learning. In CVPR, 2023c.
  87. Efficient streaming language models with attention sinks. arXiv, 2023.
  88. Pointcontrast: Unsupervised pre-training for 3d point cloud understanding. In ECCV, 2020.
  89. On layer normalization in the transformer architecture. In ICML, 2020.
  90. Paconv: Position adaptive convolution with dynamic kernel assembling on point clouds. In CVPR, 2021.
  91. Pointasnl: Robust point clouds processing using nonlocal neural networks with adaptive sampling. In CVPR, 2020.
  92. 2dpass: 2d priors assisted semantic segmentation on lidar point clouds. In ECCV, 2022.
  93. Second: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018.
  94. Modeling point clouds with self-attention and gumbel subset sampling. In CVPR, 2019.
  95. Swin3d: A pretrained transformer backbone for 3d indoor scene understanding. arXiv:2304.06906, 2023.
  96. Center-based 3d object detection and tracking. In CVPR, 2021.
  97. Point-BERT: Pre-training 3D point cloud transformers with masked point modeling. In CVPR, 2022.
  98. Deep fusionnet for point cloud semantic segmentation. In ECCV, 2020.
  99. Pointweb: Enhancing local neighborhood features for point cloud processing. In CVPR, 2019.
  100. Point transformer. In ICCV, 2021.
  101. Ponderv2: Pave the way for 3d foundataion model with a universal pre-training paradigm. arXiv:2310.08586, 2023.
  102. Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In CVPR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Xiaoyang Wu (28 papers)
  2. Li Jiang (88 papers)
  3. Peng-Shuai Wang (24 papers)
  4. Zhijian Liu (41 papers)
  5. Xihui Liu (92 papers)
  6. Yu Qiao (563 papers)
  7. Wanli Ouyang (358 papers)
  8. Tong He (124 papers)
  9. Hengshuang Zhao (118 papers)
Citations (104)

Summary

This paper introduces Point Transformer V3 (PTv3), a novel architecture for 3D point cloud processing that prioritizes simplicity and efficiency to achieve scalability and, consequently, stronger performance. The core idea is that model performance is more influenced by scale (dataset size, model parameters, receptive field, compute) than by intricate, often computationally expensive, design choices. PTv3 makes several key design changes to its predecessor (PTv2) to enhance speed and reduce memory consumption, allowing it to process larger receptive fields (up to 1024 points from 16 in PTv2) and achieve state-of-the-art results on over 20 downstream tasks.

Key Design Principles and Innovations

The development of PTv3 is guided by the "scaling principle," which suggests trading the marginal accuracy gains from complex mechanisms for significant improvements in simplicity and efficiency. This enables the model to scale effectively, and the performance lost from simplification is often regained or surpassed through this enhanced scalability.

The main adaptations in PTv3 include:

  1. Point Cloud Serialization:
    • Replaces K-Nearest Neighbors (KNN): Instead of costly KNN searches (which accounted for 28% of PTv2's forward time) for defining local neighborhoods, PTv3 serializes point clouds. This involves organizing points according to specific patterns, primarily using space-filling curves like Z-order and Hilbert curves (and their transposed variants: Trans Z-order, Trans Hilbert).
    • Serialized Encoding: Points are encoded with an integer representing their order along a chosen space-filling curve. This is done by projecting the point's 3D coordinates (px,py,pz)(p_x, p_y, p_z) onto a discrete grid of size gg and then applying the inverse mapping of the space-filling curve φ1(p/g)\varphi^{-1}(\lfloor \mathbf{p} / g \rfloor). Batch indices are prepended to these codes for batched processing.
    • Serialization Process: Points are sorted based on these codes. This creates a structured sequence where spatially proximate points are likely to be neighbors in the sequence. This "breaks" strict permutation invariance but offers significant efficiency. The paper notes that mappings, not physical reordering, are used.
  2. Serialized Attention:
    • Adopts Window/Dot-Product Attention: Leveraging the structured nature of serialized point clouds, PTv3 uses efficient window-based (termed "patch attention") and dot-product attention mechanisms, similar to those in image transformers.
    • Patch Grouping: Points in the serialized sequence are grouped into non-overlapping patches. This involves padding the sequence to be divisible by the patch size and then grouping contiguous points.
    • Patch Interaction: To enable information flow between patches, PTv3 employs a "Shuffle Order" strategy. This involves cyclically assigning different serialization patterns (Z-order, Trans Z-order, Hilbert, Trans Hilbert) to successive attention layers and randomly shuffling the permutation of these patterns. This is found to be more efficient and effective than alternatives like "Shift Dilation" (staggered grouping) or "Shift Patch" (shifting patch positions like Swin Transformer).
    • Positional Encoding (xCPE): PTv3 eliminates complex Relative Positional Encoding (RPE), which took 26% of PTv2's forward time. It introduces "enhanced Conditional Positional Encoding" (xCPE), implemented as a simple sparse convolutional layer with a skip connection prepended directly before the attention layer. This provides positional information efficiently.
  3. Simplified Network Details:
    • Block Structure: Uses a pre-norm structure (Layer Normalization before the main operation) instead of post-norm, and Layer Normalization (LN) instead of Batch Normalization (BN) within attention blocks for better stability with varying batch sizes.
    • Pooling Strategy: Retains Grid Pooling from PTv2. Interestingly, Batch Normalization is found to be crucial here for stabilizing data distribution during pooling, unlike in attention blocks. The Shuffle Order strategy is also integrated into pooling.
    • Model Architecture: Follows a U-Net like framework with four encoder and four decoder stages. Encoder block depths are [2, 2, 6, 2] and decoder depths are [1, 1, 1, 1].

Implementation and Performance

  • Efficiency: PTv3 achieves a 3.3x increase in inference speed and a 10.2x reduction in memory usage compared to PTv2. For example, on the nuScenes dataset, PTv3 with a receptive field of 1024 points has an inference latency of 44ms and uses 1.2G memory, significantly better than PTv2/32 (213ms, 19.4G).
  • Scalability: The design allows the receptive field (patch size for attention) to scale from 16 to 1024 or even 4096 points with minimal impact on latency and memory, thanks to optimizations like FlashAttention.
  • Accuracy: PTv3 achieves state-of-the-art results across numerous benchmarks:
    • Indoor Semantic Segmentation: On ScanNet (test), PTv3 (scratch) achieves 77.9% mIoU (vs. 74.2% for PTv2). With multi-dataset pre-training (PPT), it reaches 79.4% mIoU. Similar gains are seen on S3DIS and ScanNet200.
    • Outdoor Semantic Segmentation: On nuScenes (test), PTv3 (scratch) gets 82.7% mIoU (vs. 82.6% for PTv2). With PPT, it achieves 83.0% mIoU. Improvements are also shown on SemanticKITTI and Waymo.
    • Indoor Instance Segmentation (ScanNet, PointGroup framework): PTv3 (scratch) achieves 40.9% mAP (vs. 38.3% for PTv2).
    • Outdoor Object Detection (Waymo, CenterPoint framework): PTv3 outperforms previous methods, including FlatFormer. For single-frame input, PTv3 achieves 70.5 mAPH (L2) vs. 67.2 mAPH for FlatFormer.
  • Ablation Studies:
    • Serialization Patterns: Using a mix of all four patterns (Z, Trans Z, Hilbert, Trans Hilbert) with Shuffle Order yields the best results (e.g., 77.3% mIoU on ScanNet validation).
    • Patch Interaction: Shuffle Order with multiple serialization patterns is superior in performance and efficiency compared to Shift Dilation or Shift Patch.
    • Positional Encoding: xCPE (77.3% mIoU, 61ms latency) outperforms RPE (75.9%, 72ms) and standard CPE (76.6%, 58ms).
    • Patch Size: Performance generally increases with patch size up to 1024, with only a slight drop at 4096, demonstrating effective scaling of the receptive field.

Practical Implementation Details

  • Frameworks: Implemented using Pointcept for general tasks and OpenPCDet for outdoor object detection.
  • Serialization Encoding Example:

    The encoding for a point p\mathbf{p} with batch index bb and grid size gg is:

    Encode(p,b,g)=(bk)φ1(p/g)Encode(\mathbf{p}, b, g) = (b \ll k) | \varphi^{-1}(\lfloor \mathbf{p} / g \rfloor)

    where kk is the number of bits for the spatial code. A 64-bit integer is used.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Pseudocode for Z-order encoding (simplified 2D for illustration)
def interleave_bits(x, y):
    # Assuming x and y are integers representing discretized coordinates
    z_order_code = 0
    # Max bits depends on grid resolution
    for i in range(MAX_BITS):
        z_order_code |= (x & (1 << i)) << (i + 1) # x bits at even positions
        z_order_code |= (y & (1 << i)) << i       # y bits at odd positions
    return z_order_code

# p_discrete = np.floor(points / grid_size).astype(int)
# codes = []
# for i in range(num_points):
#   spatial_code = z_order_3d(p_discrete[i, 0], p_discrete[i, 1], p_discrete[i, 2])
#   full_code = (batch_indices[i] << K_BITS_FOR_SPATIAL) | spatial_code
#   codes.append(full_code)
# sorted_indices = np.argsort(codes)

  • xCPE Implementation: This is a sparse convolutional layer placed before the attention mechanism.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    
    # Simplified PyTorch-like pseudocode for an xCPE block
    class XCPEAttentionBlock(nn.Module):
        def __init__(self, dim, num_heads, sparse_conv_kernel_size):
            super().__init__()
            self.sparse_conv = SparseConv3d(dim, dim, kernel_size=sparse_conv_kernel_size) # Or other sparse conv implementation
            self.norm1_conv = LayerNorm(dim) # Or BatchNorm depending on context (pooling vs attention block)
            
            self.norm1_attn = LayerNorm(dim)
            self.attn = SerializedMultiHeadAttention(dim, num_heads) # Assumes serialized input
            # ... MLP layers ...
    
        def forward(self, x_sparse, x_serialized, pos_indices, batch_indices):
            # x_sparse is input for sparse convolution (voxelized features)
            # x_serialized is input for attention (serialized point features)
            
            # xCPE part
            identity_sparse = x_sparse.features # Assuming SparseTensor format
            x_sparse_features_conv = self.sparse_conv(x_sparse).features
            x_sparse_features = self.norm1_conv(identity_sparse + x_sparse_features_conv)
            
            # Apply conv features to serialized points (e.g., via de-voxelization or mapping)
            # For simplicity, assume x_serialized_plus_xcpe is derived
            # In practice, this requires careful handling of sparse to dense/serialized feature transfer
            
            # Attention part
            # attn_input = self.norm1_attn(x_serialized_plus_xcpe) # Apply xCPE features
            # x_serialized = x_serialized + self.attn(attn_input, pos_indices, batch_indices) 
            # ... MLP forward ...
            return x_serialized
    The paper directly prepends a sparse convolution layer with a skip connection before the attention layer. The features from this sparse convolution, which capture local geometric context, act as the conditional positional encoding.
  • Training:
    • Uses AdamW optimizer with Cosine learning rate scheduler.
    • Losses: CrossEntropy and Lovasz for semantic segmentation.
    • Data augmentations include random dropout, rotation, scaling, flipping, jitter, elastic distort, color jitter, etc.
    • Multi-dataset joint training (PPT) significantly boosts performance.

Limitations and Future Directions

  • Attention Mechanism: Dot-product attention, while efficient, shows slower convergence and limitations in scaling depth compared to vector attention, potentially due to "attention sinks." Further exploration of attention mechanisms is needed.
  • Scaling Parameters: While PTv3 is efficient, scaling model parameters further requires corresponding increases in data and task scope (e.g., multi-task, multi-modal learning).
  • Multi-Modality: Point cloud serialization could be applied to other data types like images, transforming them into 1D sequences. This opens avenues for multi-modal models bridging 2D and 3D, enabling synergistic pre-training.

In summary, Point Transformer V3 offers a practical and effective approach to scaling point cloud transformers by simplifying core components like neighborhood search and positional encoding. Its emphasis on serialization and efficient attention mechanisms leads to significant gains in speed and memory, enabling larger receptive fields and achieving superior performance across a wide range of 3D perception tasks.

HackerNews