Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Exploring PDE Modeling for Point Cloud Video Representation Learning (2404.04720v2)

Published 6 Apr 2024 in cs.CV

Abstract: Point cloud video representation learning is challenging due to complex structures and unordered spatial arrangement. Traditional methods struggle with frame-to-frame correlations and point-wise correspondence tracking. Recently, partial differential equations (PDE) have provided a new perspective in uniformly solving spatial-temporal data information within certain constraints. While tracking tangible point correspondence remains challenging, we propose to formalize point cloud video representation learning as a PDE-solving problem. Inspired by fluid analysis, where PDEs are used to solve the deformation of spatial shape over time, we employ PDE to solve the variations of spatial points affected by temporal information. By modeling spatial-temporal correlations, we aim to regularize spatial variations with temporal features, thereby enhancing representation learning in point cloud videos. We introduce Motion PointNet composed of a PointNet-like encoder and a PDE-solving module. Initially, we construct a lightweight yet effective encoder to model an initial state of the spatial variations. Subsequently, we develop our PDE-solving module in a parameterized latent space, tailored to address the spatio-temporal correlations inherent in point cloud video. The process of solving PDE is guided and refined by a contrastive learning structure, which is pivotal in reshaping the feature distribution, thereby optimizing the feature representation within point cloud video data. Remarkably, our Motion PointNet achieves an impressive accuracy of 97.52% on the MSRAction-3D dataset, surpassing the current state-of-the-art in all aspects while consuming minimal resources (only 0.72M parameters and 0.82G FLOPs).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. 3dinaction: Understanding human actions in 3d point clouds. arXiv preprint arXiv:2303.06346.
  2. Skelemotion: A new representation of skeleton joint sequences based on motion information for 3d action recognition. In 2019 16th IEEE international conference on advanced video and signal based surveillance (AVSS), pages 1–8. IEEE.
  3. Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In 2015 IEEE International Conference on Image Processing (ICIP), pages 168–172.
  4. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3075–3084.
  5. Point 4d transformer networks for spatio-temporal modeling in point cloud videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR.
  6. Point spatio-temporal transformer networks for point cloud video modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):2181–2192.
  7. Pstnet: Point spatio-temporal convolution on point cloud sequences. In International Conference on Learning Representations.
  8. Deep hierarchical representation of point cloud videos via spatio-temporal decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9918–9930.
  9. Spectral neural operators. arXiv preprint arXiv:2205.10573.
  10. Fornberg, B. (1998). A practical guide to pseudospectral methods. Number 1. Cambridge university press.
  11. Numerical analysis of spectral methods: theory and applications. SIAM.
  12. Physics-informed machine learning. Nature Reviews Physics, 3(6):422–440.
  13. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673.
  14. Unsupervised learning of view-invariant action representations. Advances in neural information processing systems, 31.
  15. Action recognition based on a bag of 3d points. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, pages 9–14.
  16. Sequentialpointnet: A strong parallelized point cloud sequence network for 3d action recognition. arXiv preprint arXiv:2111.08492.
  17. Pointmapnet: Point cloud feature map network for 3d human action recognition. Symmetry, 15(2).
  18. Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895.
  19. Fourier neural operator for parametric partial differential equations. In International Conference on Learning Representations.
  20. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7083–7093.
  21. Geometrymotion-transformer: An end-to-end framework for 3d action recognition. IEEE Transactions on Multimedia, pages 1–13.
  22. Geometrymotion-net: A strong two-stream baseline for 3d action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 31(12):4711–4721.
  23. HT-net: Hierarchical transformer based operator learning model for multiscale PDEs.
  24. Meteornet: Deep learning on dynamic 3d point cloud sequences. In ICCV.
  25. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 143–152.
  26. Learning nonlinear operators via deeponet based on the universal approximation theorem of operators. Nature machine intelligence, 3(3):218–229.
  27. Fast and furious: Real time end-to-end 3d detection, tracking and motion forecasting with a single convolutional net. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 3569–3577.
  28. An efficient pointlstm for point clouds based gesture recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
  29. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  30. Alignment-uniformity aware representation learning for zero-shot video classification. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19936–19945.
  31. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30.
  32. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019.
  33. Skeleton-based action recognition with directed graph neural networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7904–7913.
  34. Pref: Predictability regularized neural motion fields. In European Conference on Computer Vision, pages 664–681. Springer.
  35. Tolstov, G. P. (2012). Fourier series. Courier Corporation.
  36. Factorized fourier neural operators. arXiv preprint arXiv:2111.13802.
  37. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
  38. Depth pooling based large-scale 3-d action recognition with convolutional neural networks. IEEE Transactions on Multimedia, 20(5):1051–1061.
  39. Space-time event clouds for gesture recognition: From rgb cameras to event cameras. In 2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1826–1835.
  40. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929–9939. PMLR.
  41. 3dv: 3d dynamic voxel for action recognition in depth video. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 511–520.
  42. Solving high-dimensional pdes with latent spectral models. In International Conference on Machine Learning.
  43. Disentangling stochastic pde dynamics for unsupervised video prediction. IEEE Transactions on Neural Networks and Learning Systems.
  44. Action recognition for depth video using multi-view dynamic images. Information Sciences, 480:287–304.
  45. Pde-based progressive prediction framework for attribute compression of 3d point clouds. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9271–9281.
  46. No pain, big gain: Classify dynamic point cloud sequences with static models by fitting feature-level space-time surfaces. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8510–8520.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Zhuoxu Huang (4 papers)
  2. Zhenkun Fan (1 paper)
  3. Tao Xu (133 papers)
  4. Jungong Han (111 papers)

Summary

We haven't generated a summary for this paper yet.