Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

119 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

5 2

Point Transformer V3: Simpler, Faster, Stronger (2312.10035v2)

Published 15 Dec 2023 in cs.CV

Abstract: This paper is not motivated to seek innovation within the attention mechanism. Instead, it focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing, leveraging the power of scale. Drawing inspiration from recent advances in 3D large-scale representation learning, we recognize that model performance is more influenced by scale than by intricate design. Therefore, we present Point Transformer V3 (PTv3), which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the overall performance after scaling, such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns. This principle enables significant scaling, expanding the receptive field from 16 to 1024 points while remaining efficient (a 3x increase in processing speed and a 10x improvement in memory efficiency compared with its predecessor, PTv2). PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios. Further enhanced with multi-dataset joint training, PTv3 pushes these results to a higher level.

References (102)

Authors (9)

Xiaoyang Wu (28 papers)
Li Jiang (88 papers)
Peng-Shuai Wang (24 papers)
Zhijian Liu (41 papers)
Xihui Liu (92 papers)
Yu Qiao (563 papers)
Wanli Ouyang (358 papers)
Tong He (124 papers)
Hengshuang Zhao (118 papers)

Citations (104)

View on Semantic Scholar

Summary

This paper introduces Point Transformer V3 (PTv3), a novel architecture for 3D point cloud processing that prioritizes simplicity and efficiency to achieve scalability and, consequently, stronger performance. The core idea is that model performance is more influenced by scale (dataset size, model parameters, receptive field, compute) than by intricate, often computationally expensive, design choices. PTv3 makes several key design changes to its predecessor (PTv2) to enhance speed and reduce memory consumption, allowing it to process larger receptive fields (up to 1024 points from 16 in PTv2) and achieve state-of-the-art results on over 20 downstream tasks.

Key Design Principles and Innovations

The development of PTv3 is guided by the "scaling principle," which suggests trading the marginal accuracy gains from complex mechanisms for significant improvements in simplicity and efficiency. This enables the model to scale effectively, and the performance lost from simplification is often regained or surpassed through this enhanced scalability.

The main adaptations in PTv3 include:

Point Cloud Serialization:
- Replaces K-Nearest Neighbors (KNN): Instead of costly KNN searches (which accounted for 28% of PTv2's forward time) for defining local neighborhoods, PTv3 serializes point clouds. This involves organizing points according to specific patterns, primarily using space-filling curves like Z-order and Hilbert curves (and their transposed variants: Trans Z-order, Trans Hilbert).
- Serialized Encoding: Points are encoded with an integer representing their order along a chosen space-filling curve. This is done by projecting the point's 3D coordinates $(p_x, p_y, p_z)$ onto a discrete grid of size $g$ and then applying the inverse mapping of the space-filling curve $\varphi^{-1}(\lfloor \mathbf{p} / g \rfloor)$ . Batch indices are prepended to these codes for batched processing.
- Serialization Process: Points are sorted based on these codes. This creates a structured sequence where spatially proximate points are likely to be neighbors in the sequence. This "breaks" strict permutation invariance but offers significant efficiency. The paper notes that mappings, not physical reordering, are used.
Serialized Attention:
- Adopts Window/Dot-Product Attention: Leveraging the structured nature of serialized point clouds, PTv3 uses efficient window-based (termed "patch attention") and dot-product attention mechanisms, similar to those in image transformers.
- Patch Grouping: Points in the serialized sequence are grouped into non-overlapping patches. This involves padding the sequence to be divisible by the patch size and then grouping contiguous points.
- Patch Interaction: To enable information flow between patches, PTv3 employs a "Shuffle Order" strategy. This involves cyclically assigning different serialization patterns (Z-order, Trans Z-order, Hilbert, Trans Hilbert) to successive attention layers and randomly shuffling the permutation of these patterns. This is found to be more efficient and effective than alternatives like "Shift Dilation" (staggered grouping) or "Shift Patch" (shifting patch positions like Swin Transformer).
- Positional Encoding (xCPE): PTv3 eliminates complex Relative Positional Encoding (RPE), which took 26% of PTv2's forward time. It introduces "enhanced Conditional Positional Encoding" (xCPE), implemented as a simple sparse convolutional layer with a skip connection prepended directly before the attention layer. This provides positional information efficiently.
Simplified Network Details:
- Block Structure: Uses a pre-norm structure (Layer Normalization before the main operation) instead of post-norm, and Layer Normalization (LN) instead of Batch Normalization (BN) within attention blocks for better stability with varying batch sizes.
- Pooling Strategy: Retains Grid Pooling from PTv2. Interestingly, Batch Normalization is found to be crucial here for stabilizing data distribution during pooling, unlike in attention blocks. The Shuffle Order strategy is also integrated into pooling.
- Model Architecture: Follows a U-Net like framework with four encoder and four decoder stages. Encoder block depths are [2, 2, 6, 2] and decoder depths are [1, 1, 1, 1].

Implementation and Performance

Efficiency: PTv3 achieves a 3.3x increase in inference speed and a 10.2x reduction in memory usage compared to PTv2. For example, on the nuScenes dataset, PTv3 with a receptive field of 1024 points has an inference latency of 44ms and uses 1.2G memory, significantly better than PTv2/32 (213ms, 19.4G).
Scalability: The design allows the receptive field (patch size for attention) to scale from 16 to 1024 or even 4096 points with minimal impact on latency and memory, thanks to optimizations like FlashAttention.
Accuracy: PTv3 achieves state-of-the-art results across numerous benchmarks:
- Indoor Semantic Segmentation: On ScanNet (test), PTv3 (scratch) achieves 77.9% mIoU (vs. 74.2% for PTv2). With multi-dataset pre-training (PPT), it reaches 79.4% mIoU. Similar gains are seen on S3DIS and ScanNet200.
- Outdoor Semantic Segmentation: On nuScenes (test), PTv3 (scratch) gets 82.7% mIoU (vs. 82.6% for PTv2). With PPT, it achieves 83.0% mIoU. Improvements are also shown on SemanticKITTI and Waymo.
- Indoor Instance Segmentation (ScanNet, PointGroup framework): PTv3 (scratch) achieves 40.9% mAP (vs. 38.3% for PTv2).
- Outdoor Object Detection (Waymo, CenterPoint framework): PTv3 outperforms previous methods, including FlatFormer. For single-frame input, PTv3 achieves 70.5 mAPH (L2) vs. 67.2 mAPH for FlatFormer.
Ablation Studies:
- Serialization Patterns: Using a mix of all four patterns (Z, Trans Z, Hilbert, Trans Hilbert) with Shuffle Order yields the best results (e.g., 77.3% mIoU on ScanNet validation).
- Patch Interaction: Shuffle Order with multiple serialization patterns is superior in performance and efficiency compared to Shift Dilation or Shift Patch.
- Positional Encoding: xCPE (77.3% mIoU, 61ms latency) outperforms RPE (75.9%, 72ms) and standard CPE (76.6%, 58ms).
- Patch Size: Performance generally increases with patch size up to 1024, with only a slight drop at 4096, demonstrating effective scaling of the receptive field.

Practical Implementation Details

Frameworks: Implemented using Pointcept for general tasks and OpenPCDet for outdoor object detection.
Serialization Encoding Example:

The encoding for a point $\mathbf{p}$ with batch index $b$ and grid size $g$ is:

$Encode(\mathbf{p}, b, g) = (b \ll k) | \varphi^{-1}(\lfloor \mathbf{p} / g \rfloor)$

where $k$ is the number of bits for the spatial code. A 64-bit integer is used.

# Pseudocode for Z-order encoding (simplified 2D for illustration)
def interleave_bits(x, y):
    # Assuming x and y are integers representing discretized coordinates
    z_order_code = 0
    # Max bits depends on grid resolution
    for i in range(MAX_BITS):
        z_order_code |= (x & (1 << i)) << (i + 1) # x bits at even positions
        z_order_code |= (y & (1 << i)) << i       # y bits at odd positions
    return z_order_code

# p_discrete = np.floor(points / grid_size).astype(int)
# codes = []
# for i in range(num_points):
#   spatial_code = z_order_3d(p_discrete[i, 0], p_discrete[i, 1], p_discrete[i, 2])
#   full_code = (batch_indices[i] << K_BITS_FOR_SPATIAL) | spatial_code
#   codes.append(full_code)
# sorted_indices = np.argsort(codes)

xCPE Implementation: This is a sparse convolutional layer placed before the attention mechanism.

# Simplified PyTorch-like pseudocode for an xCPE block
class XCPEAttentionBlock(nn.Module):
    def __init__(self, dim, num_heads, sparse_conv_kernel_size):
        super().__init__()
        self.sparse_conv = SparseConv3d(dim, dim, kernel_size=sparse_conv_kernel_size) # Or other sparse conv implementation
        self.norm1_conv = LayerNorm(dim) # Or BatchNorm depending on context (pooling vs attention block)
        
        self.norm1_attn = LayerNorm(dim)
        self.attn = SerializedMultiHeadAttention(dim, num_heads) # Assumes serialized input
        # ... MLP layers ...

    def forward(self, x_sparse, x_serialized, pos_indices, batch_indices):
        # x_sparse is input for sparse convolution (voxelized features)
        # x_serialized is input for attention (serialized point features)
        
        # xCPE part
        identity_sparse = x_sparse.features # Assuming SparseTensor format
        x_sparse_features_conv = self.sparse_conv(x_sparse).features
        x_sparse_features = self.norm1_conv(identity_sparse + x_sparse_features_conv)
        
        # Apply conv features to serialized points (e.g., via de-voxelization or mapping)
        # For simplicity, assume x_serialized_plus_xcpe is derived
        # In practice, this requires careful handling of sparse to dense/serialized feature transfer
        
        # Attention part
        # attn_input = self.norm1_attn(x_serialized_plus_xcpe) # Apply xCPE features
        # x_serialized = x_serialized + self.attn(attn_input, pos_indices, batch_indices) 
        # ... MLP forward ...
        return x_serialized

The paper directly prepends a sparse convolution layer with a skip connection before the attention layer. The features from this sparse convolution, which capture local geometric context, act as the conditional positional encoding.

Training:
- Uses AdamW optimizer with Cosine learning rate scheduler.
- Losses: CrossEntropy and Lovasz for semantic segmentation.
- Data augmentations include random dropout, rotation, scaling, flipping, jitter, elastic distort, color jitter, etc.
- Multi-dataset joint training (PPT) significantly boosts performance.

Limitations and Future Directions

Attention Mechanism: Dot-product attention, while efficient, shows slower convergence and limitations in scaling depth compared to vector attention, potentially due to "attention sinks." Further exploration of attention mechanisms is needed.
Scaling Parameters: While PTv3 is efficient, scaling model parameters further requires corresponding increases in data and task scope (e.g., multi-task, multi-modal learning).
Multi-Modality: Point cloud serialization could be applied to other data types like images, transforming them into 1D sequences. This opens avenues for multi-modal models bridging 2D and 3D, enabling synergistic pre-training.

In summary, Point Transformer V3 offers a practical and effective approach to scaling point cloud transformers by simplifying core components like neighborhood search and positional encoding. Its emphasis on serialization and efficient attention mechanisms leads to significant gains in speed and memory, enabling larger receptive fields and achieving superior performance across a wide range of 3D perception tasks.

PDF Markdown

Tweets

https://twitter.com/maxxxzdn/status/1929256359498768746

https://twitter.com/22146921/status/1736868922706796865

HackerNews

Point Transformer V3: Simpler, Faster, Stronger (2 points, 1 comment)