Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PointRWKV: Efficient RWKV-Like Model for Hierarchical Point Cloud Learning (2405.15214v2)

Published 24 May 2024 in cs.CV

Abstract: Transformers have revolutionized the point cloud learning task, but the quadratic complexity hinders its extension to long sequence and makes a burden on limited computational resources. The recent advent of RWKV, a fresh breed of deep sequence models, has shown immense potential for sequence modeling in NLP tasks. In this paper, we present PointRWKV, a model of linear complexity derived from the RWKV model in the NLP field with necessary modifications for point cloud learning tasks. Specifically, taking the embedded point patches as input, we first propose to explore the global processing capabilities within PointRWKV blocks using modified multi-headed matrix-valued states and a dynamic attention recurrence mechanism. To extract local geometric features simultaneously, we design a parallel branch to encode the point cloud efficiently in a fixed radius near-neighbors graph with a graph stabilizer. Furthermore, we design PointRWKV as a multi-scale framework for hierarchical feature learning of 3D point clouds, facilitating various downstream tasks. Extensive experiments on different point cloud learning tasks show our proposed PointRWKV outperforms the transformer- and mamba-based counterparts, while significantly saving about 42\% FLOPs, demonstrating the potential option for constructing foundational 3D models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Blackmamba: Mixture of experts for state-space models. arXiv preprint arXiv:2402.01771, 2024.
  2. 3d semantic parsing of large-scale indoor spaces. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1534–1543, 2016.
  3. Point convolutional neural networks by extension operators. arXiv preprint arXiv:1803.10091, 2018.
  4. The complexity of finding fixed-radius near neighbors. Information processing letters, 6(6):209–212, 1977.
  5. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  6. Video mamba suite: State space model as a versatile alternative for video understanding. arXiv preprint arXiv:2403.09626, 2024a.
  7. Rsmamba: Remote sensing image classification with state space model. arXiv preprint arXiv:2403.19654, 2024b.
  8. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3075–3084, 2019.
  9. Pointvector: a vector representation in point cloud analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9455–9465, 2023.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  12. Vision-rwkv: Efficient and scalable visual perception with rwkv-like architectures. arXiv preprint arXiv:2403.02308, 2024.
  13. Hungry hungry hippos: Towards language modeling with state space models. In ICLR, 2022.
  14. A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  15. Efficiently modeling long sequences with structured state spaces. In ICLR, 2021a.
  16. Combining recurrent, convolutional, and continuous-time models with linear state space layers. In NeurlPS, 2021b.
  17. Pct: Point cloud transformer. Computational Visual Media, 7:187–199, 2021.
  18. Mvtn: Multi-view transformation network for 3d shape recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11, 2021.
  19. Mamba3d: Enhancing local features for 3d point cloud analysis via state space model. arXiv preprint arXiv:2404.14966, 2024.
  20. Unetr: Transformers for 3d medical image segmentation. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 574–584, 2022.
  21. Mambaad: Exploring state space models for multi-class unsupervised anomaly detection. arXiv preprint arXiv:2404.06564, 2024.
  22. Svga-net: Sparse voxel-graph attention network for 3d object detection from point clouds. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 870–878, 2022.
  23. 3qnet: 3d point cloud geometry quantization compression network. ACM TOG, 2022a.
  24. Adaptive recurrent forward network for dense point cloud completion. TMM, 2022b.
  25. Learning to measure the point cloud reconstruction loss in a representation space. In CVPR, 2023.
  26. A-cnn: Annularly convolutional neural networks on point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7421–7430, 2019.
  27. L. Landrieu and M. Simonovsky. Large-scale point cloud semantic segmentation with superpoint graphs. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4558–4567, 2018.
  28. Videomamba: State space model for efficient video understanding. arXiv preprint arXiv:2403.06977, 2024.
  29. Pointcnn: Convolution on x-transformed points. Advances in neural information processing systems, 31, 2018.
  30. Pointmamba: A simple state space model for point cloud analysis. arXiv preprint arXiv:2402.10739, 2024.
  31. Swin-umamba: Mamba-based unet with imagenet-based pretraining. arXiv preprint arXiv:2402.03302, 2024a.
  32. Point mamba: A novel point cloud backbone based on state space model with octree-based ordering strategy. arXiv preprint arXiv:2403.06467, 2024b.
  33. Vmamba: Visual state space model. arXiv preprint arXiv:2401.10166, 2024c.
  34. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  35. I. Loshchilov and F. Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  36. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722, 2024.
  37. Rethinking network design and local geometry in point cloud: A simple residual mlp framework. arXiv preprint arXiv:2202.07123, 2022.
  38. Voxel transformer for 3d object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3164–3173, 2021.
  39. 3d object detection for autonomous driving: A review and new outlooks. arXiv preprint arXiv:2206.09474, 1, 2022.
  40. Long range language modeling via gated state spaces. In ICLR, 2022.
  41. An end-to-end transformer model for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2906–2917, 2021.
  42. Masked autoencoders for point cloud self-supervised learning. In European conference on computer vision, pages 604–621. Springer, 2022.
  43. Fast point transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16949–16958, 2022.
  44. Can mamba learn how to learn? a comparative study on in-context learning tasks. arXiv preprint arXiv:2402.04248, 2024.
  45. Multiple 3d object tracking for augmented reality. In 2008 7th IEEE/ACM International Symposium on Mixed and Augmented Reality, pages 117–120. IEEE, 2008.
  46. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  47. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence. arXiv preprint arXiv:2404.05892, 2024.
  48. Moe-mamba: Efficient selective state space models with mixture of experts. arXiv preprint arXiv:2401.04081, 2024.
  49. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660, 2017a.
  50. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017b.
  51. Contrast with reconstruct: Contrastive 3d representation learning guided by generative pretraining. In International Conference on Machine Learning, pages 28223–28243. PMLR, 2023.
  52. Pointnext: Revisiting pointnet++ with improved training and scaling strategies. Advances in Neural Information Processing Systems, 35:23192–23204, 2022.
  53. Improving language understanding by generative pre-training. 2018.
  54. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18, pages 234–241. Springer, 2015.
  55. J. Ruan and S. Xiang. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:2402.02491, 2024.
  56. Clip-fields: Weakly supervised semantic fields for robotic memory. arXiv preprint arXiv:2210.05663, 2022.
  57. Mining point cloud local structures by kernel correlation and graph pooling. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4548–4557, 2018.
  58. Simplified state space layers for sequence modeling. In ICLR, 2022.
  59. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6411–6420, 2019.
  60. Revisiting point cloud classification: A new benchmark dataset and classification model on real-world data. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1588–1597, 2019.
  61. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  62. Unsupervised point cloud pre-training via occlusion completion. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9782–9792, 2021.
  63. Mambabyte: Token-free selective state space model. arXiv preprint arXiv:2401.13660, 2024.
  64. Dynamic graph cnn for learning on point clouds. ACM Transactions on Graphics (tog), 38(5):1–12, 2019.
  65. Attention-based point cloud edge sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5333–5343, 2023a.
  66. Pointconv: Deep convolutional networks on 3d point clouds. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 9621–9630, 2019.
  67. Point transformer v2: Grouped vector attention and partition-based pooling. Advances in Neural Information Processing Systems, 35:33330–33342, 2022.
  68. Point transformer v3: Simpler, faster, stronger. arXiv preprint arXiv:2312.10035, 2023b.
  69. 3d shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1912–1920, 2015.
  70. Vivim: a video vision mamba for medical video object segmentation. arXiv preprint arXiv:2401.14168, 2024.
  71. A scalable active framework for region annotation in 3d shape collections. ACM Transactions on Graphics (ToG), 35(6):1–12, 2016.
  72. Pointr: Diverse point cloud completion with geometry-aware transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 12498–12507, 2021.
  73. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19313–19322, 2022.
  74. Point-m2ae: multi-scale masked autoencoders for hierarchical point cloud pre-training. Advances in neural information processing systems, 35:27061–27074, 2022.
  75. Point could mamba: Point cloud learning via state space model. arXiv preprint arXiv:2403.00762, 2024.
  76. Pointweb: Enhancing local neighborhood features for point cloud processing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5565–5573, 2019.
  77. Point transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021.
  78. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417, 2024.
Citations (8)

Summary

  • The paper introduces PointRWKV, a novel model that applies a RWKV-based approach to efficiently process 3D point clouds with linear complexity.
  • It employs a hierarchical strategy using multi-scale masking with FPS and k-NN, combined with PRWKV blocks that integrate BQE and local graph merging.
  • Experiments demonstrate remarkable performance with accuracies of 97.52% on ScanObjectNN and 96.89% on ModelNet40, and enhanced few-shot learning capabilities.

Efficient RWKV-Based Approach for Hierarchical Point Cloud Learning

This essay discusses the paper titled "PointRWKV: Efficient RWKV-Like Model for Hierarchical Point Cloud Learning," which presents a novel method for point cloud learning by leveraging the Receptance Weighted Key Value (RWKV) model initially formulated for NLP tasks. Here, the authors introduce the PointRWKV, demonstrating its efficiency and superior performance in 3D point cloud processing.

Introduction and Motivation

Processing 3D point clouds poses inherent challenges due to their irregularity and sparsity. Traditional approaches, such as Transformers, although effective, suffer from quadratic complexity, which impedes scalability. The PointRWKV model proposed in this paper addresses this issue by adopting an architecture inspired by RWKV, known for its linear complexity in sequence modeling tasks.

Methodology

Hierarchical Point Cloud Learning

The PointRWKV model employs a hierarchical architecture, encoding point clouds at multiple scales to capture both local and global features. The process involves multi-scale masking using Furthest Point Sampling (FPS) and k-Nearest-Neighbour (k-NN) methods to obtain various resolutions of the input point cloud. A mini-PointNet is subsequently used to embed these masked point patches and generate respective token embeddings.

PRWKV Blocks

The core element of the model is the PRWKV block, which consists of two parallel branches for integrative feature modulation and local graph-based merging. The integrative feature branch utilizes a modified bidirectional quadratic expansion (BQE) function with spatial and channel mixing mechanisms, enhancing token interaction across the dataset. The local graph-based merging branch, equipped with a graph stabilizer mechanism, focuses on maintaining local geometric consistency by refining vertex features iteratively.

Experiments and Results

Classification Tasks

Extensive experiments demonstrate PointRWKV's prowess across several prominent datasets. In the ScanObjectNN dataset, PointRWKV attains 97.52% overall accuracy, outperforming state-of-the-art methods including transformer-based and Mamba-based models by significant margins (Tab. 1). Similarly, in the ModelNet40 dataset, it achieves an overall accuracy of 96.89%, showcasing a marked improvement over existing techniques.

Part Segmentation

For the ShapeNetPart dataset, PointRWKV achieves 90.26% instance mean Intersection over Union (mIoU), setting new benchmarks for both class and instance-level segmentation tasks while maintaining lower parameter counts and FLOPs (Tab. 2).

Few-Shot Learning

In few-shot learning tasks on ModelNet40, PointRWKV exhibits robust performance, surpassing previous methods by up to 2.5% in accuracy, highlighting its capability to generalize effectively with limited data (Tab. 3).

Ablation Studies

The authors conduct comprehensive ablation studies to validate the contributions of various components in the model. The inclusion of the BQE function, the bidirectional attention mechanism, and the hierarchical multi-scale point processing are shown to significantly enhance model performance. Furthermore, the graph stabilizer within the local graph-based merging branch is essential for achieving optimal results (Tab. 4).

Implications and Future Directions

The PointRWKV model underscores the potential of RWKV-like architectures in efficiently handling 3D point cloud data by balancing computational complexity and performance. It sets a precedent for future research to explore such architectures for other 3D vision tasks, potentially extending beyond classification and segmentation to reconstruction and generation.

Conclusion

PointRWKV represents a significant advancement in hierarchical point cloud learning, providing a robust, scalable, and efficient alternative to transformer and Mamba-based models. The paper substantiates its claims through empirical evidence across multiple benchmarks, laying the groundwork for further exploration of RWKV models in diverse 3D vision applications.

In summary, the PointRWKV introduces an innovative approach for point cloud learning, effectively addressing the balance between accuracy and complexity, thus contributing a valuable perspective to the field of 3D vision and machine learning.

Github Logo Streamline Icon: https://streamlinehq.com