Modeling Point Clouds with Self-Attention and Gumbel Subset Sampling (1904.03375v1)

Published 6 Apr 2019 in cs.CV and cs.LG

Abstract: Geometric deep learning is increasingly important thanks to the popularity of 3D sensors. Inspired by the recent advances in NLP domain, the self-attention transformer is introduced to consume the point clouds. We develop Point Attention Transformers (PATs), using a parameter-efficient Group Shuffle Attention (GSA) to replace the costly Multi-Head Attention. We demonstrate its ability to process size-varying inputs, and prove its permutation equivariance. Besides, prior work uses heuristics dependence on the input data (e.g., Furthest Point Sampling) to hierarchically select subsets of input points. Thereby, we for the first time propose an end-to-end learnable and task-agnostic sampling operation, named Gumbel Subset Sampling (GSS), to select a representative subset of input points. Equipped with Gumbel-Softmax, it produces a "soft" continuous subset in training phase, and a "hard" discrete subset in test phase. By selecting representative subsets in a hierarchical fashion, the networks learn a stronger representation of the input sets with lower computation cost. Experiments on classification and segmentation benchmarks show the effectiveness and efficiency of our methods. Furthermore, we propose a novel application, to process event camera stream as point clouds, and achieve a state-of-the-art performance on DVS128 Gesture Dataset.

Citations (360)

View on Semantic Scholar

Summary

The paper introduces Point Attention Transformers (PATs) which use Group Shuffle Attention for efficient processing and Gumbel Subset Sampling for differentiable, task-agnostic point cloud subset selection.
Experiments show PATs achieve 91.7% classification accuracy on ModelNet40 and competitive segmentation results on S3DIS, demonstrating improved parameter efficiency without sacrificing performance.
PATs are also effectively applied to processing event camera data as point clouds, achieving higher accuracy than traditional methods on the DVS128 Gesture Dataset.

Modeling Point Clouds with Self-Attention and Gumbel Subset Sampling

The development of Point Attention Transformers (PATs) marks a notable progression in geometric deep learning, specifically tailored for processing 3D point cloud data—a representation gaining traction due to the prevalence of 3D sensors like LiDAR and RGB-D cameras. This paper advances the field by leveraging self-attention mechanisms, akin to those found in natural language processing, to effectively process and reason about point clouds, while also introducing an innovative sampling method that enhances efficiency and effectiveness.

Key Contributions

Group Shuffle Attention (GSA): PATs replace the conventional Multi-Head Attention with a more parameter-efficient Group Shuffle Attention (GSA). GSA maintains the robust relational understanding inherent in attention mechanisms but reduces computation costs. This is achieved by group linear transformations with channel shuffling, thus sustaining the permutation equivariance necessary for point cloud data processing.
Gumbel Subset Sampling (GSS): The paper proposes Gumbel Subset Sampling (GSS), a task-agnostic and permutation-invariant sampling technique. GSS offers an end-to-end differentiable approach that is applicable for hierarchical subset selection without reliance on heuristic methods like Furthest Point Sampling. By employing the Gumbel-Softmax technique, GSS can perform continuous soft sampling during training and discrete sampling at test time through annealing.

Experimental Validation

Experiments conducted on classification and segmentation benchmarks, such as ModelNet40 and S3DIS datasets, corroborate the efficacy of PATs. The paper reports a classification accuracy of 91.7% on the ModelNet40 dataset when using both FPS and GSS sampling methods, highlighting the improvements in parameter efficiency and computational cost without sacrificing performance. For segmentation tasks on the S3DIS dataset, PATs demonstrated competitive or superior performance compared to existing state-of-the-art models across various metrics.

Moreover, the application of PATs to process event camera data as point clouds, specifically applied to the DVS128 Gesture Dataset, underscores the versatility of the approach. This novel application area showed PATs achieving system accuracy higher than traditional CNN-based methods, setting a new state-of-the-art in the domain by handling the spatio-temporal characteristics of event camera streams.

Practical and Theoretical Implications

The exploration of self-attention mechanisms such as GSA in the context of 3D point clouds underscores the potential of attention-based models to generalize beyond text or 2D images to more complex data structures. GSS further expands the toolkit available for end-to-end deep learning on unstructured data, possibly influencing future architectures that may require efficient subset sampling or selection.

Future Directions

While the current results establish the strength of PATs, future work could explore optimizing GSA's computational overhead through more robust implementations or further refine GSS for specific applications beyond 3D point clouds, such as other forms of high-dimensional data or in domains requiring more refined sampling strategies. Additionally, integrating PATs with hardware accelerations for real-time applications on devices with limited power, such as event cameras, presents a practical challenge and opportunity.

In conclusion, the synergistic integration of self-attention with innovative sampling mechanisms in PATs contributes substantially to geometric deep learning methodologies, offering robust alternatives for efficiently processing and reasoning about 3D point cloud data.

PDF Markdown