Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VG4D: Vision-Language Model Goes 4D Video Recognition (2404.11605v1)

Published 17 Apr 2024 in cs.CV, cs.AI, and cs.RO

Abstract: Understanding the real world through point cloud video is a crucial aspect of robotics and autonomous driving systems. However, prevailing methods for 4D point cloud recognition have limitations due to sensor resolution, which leads to a lack of detailed information. Recent advances have shown that Vision-LLMs (VLM) pre-trained on web-scale text-image datasets can learn fine-grained visual concepts that can be transferred to various downstream tasks. However, effectively integrating VLM into the domain of 4D point clouds remains an unresolved problem. In this work, we propose the Vision-LLMs Goes 4D (VG4D) framework to transfer VLM knowledge from visual-text pre-trained models to a 4D point cloud network. Our approach involves aligning the 4D encoder's representation with a VLM to learn a shared visual and text space from training on large-scale image-text pairs. By transferring the knowledge of the VLM to the 4D encoder and combining the VLM, our VG4D achieves improved recognition performance. To enhance the 4D encoder, we modernize the classic dynamic point cloud backbone and propose an improved version of PSTNet, im-PSTNet, which can efficiently model point cloud videos. Experiments demonstrate that our method achieves state-of-the-art performance for action recognition on both the NTU RGB+D 60 dataset and the NTU RGB+D 120 dataset. Code is available at \url{https://github.com/Shark0-0/VG4D}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. J. Liang and A. Boularias, “Learning category-level manipulation tasks from point clouds with dynamic graph cnns,” in ICRA, 2023.
  2. D. Seichter, M. Köhler, B. Lewandowski, T. Wengefeld, and H.-M. Gross, “Efficient rgb-d semantic segmentation for indoor scene analysis,” in ICRA, 2021.
  3. C. Huang, O. Mees, A. Zeng, and W. Burgard, “Visual language maps for robot navigation,” in ICRA, 2023.
  4. B. Chen, F. Xia, B. Ichter, K. Rao, K. Gopalakrishnan, M. S. Ryoo, A. Stone, and D. Kappler, “Open-vocabulary queryable scene representations for real world planning,” in ICRA, 2023.
  5. D. Wang and Z.-X. Yang, “Self-supervised point cloud understanding via mask transformer and contrastive learning,” RA-L, 2023.
  6. C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” in NeurIPS, 2017.
  7. Z. Fang, X. Li, X. Li, J. M. Buhmann, C. C. Loy, and M. Liu, “Explore in-context learning for 3d point cloud understanding,” NeurIPS, 2024.
  8. Z. Fang, X. Li, X. Li, S. Zhao, and M. Liu, “Modelnet-o: A large-scale synthetic dataset for occlusion-aware point cloud classification,” arXiv preprint arXiv:2401.08210, 2024.
  9. A. Ošep, P. Voigtlaender, M. Weber, J. Luiten, and B. Leibe, “4d generic video object proposals,” in ICRA, 2020.
  10. H. Fan, Y. Yang, and M. S. Kankanhalli, “Point spatio-temporal transformer networks for point cloud video modeling,” TPAMI, 2023.
  11. Y. Zeng, C. Jiang, J. Mao, J. Han, C. Ye, Q. Huang, D.-Y. Yeung, Z. Yang, X. Liang, and H. Xu, “Clip2: Contrastive language-image-point pretraining from real-world point cloud data,” in CVPR, 2023.
  12. D. Hegde, J. M. J. Valanarasu, and V. M. Patel, “Clip goes 3d: Leveraging prompt tuning for language grounded 3d recognition,” arXiv preprint arXiv:2303.11313, 2023.
  13. H. Rasheed, M. U. khattak, M. Maaz, S. Khan, and F. S. Khan, “Finetuned clip models are efficient video learners,” in CVPR, 2023.
  14. J. Wu, X. Li, S. Xu, H. Yuan, H. Ding, Y. Yang, X. Li, J. Zhang, Y. Tong, X. Jiang, B. Ghanem, and D. Tao, “Towards open vocabulary learning: A survey,” T-PAMI, 2024.
  15. S. Xu, X. Li, S. Wu, W. Zhang, Y. Li, G. Cheng, Y. Tong, K. Chen, and C. C. Loy, “Dst-det: Simple dynamic self-training for open-vocabulary object detection,” arXiv preprint arXiv:2310.01393, 2023.
  16. L. Xue, M. Gao, C. Xing, R. Martín-Martín, J. Wu, C. Xiong, R. Xu, J. C. Niebles, and S. Savarese, “Ulip: Learning a unified representation of language, images, and point clouds for 3d understanding,” in CVPR, 2023.
  17. B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, “Expanding language-image pretrained models for general video recognition,” in ECCV, 2022.
  18. J. Chen, Y. Zhang, F. Ma, and Z. Tan, “Eb-lg module for 3d point cloud classification and segmentation,” RA-L, 2023.
  19. C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in CVPR, 2017.
  20. G. Qian, Y. Li, H. Peng, J. Mai, H. Hammoud, M. Elhoseiny, and B. Ghanem, “Pointnext: Revisiting pointnet++ with improved training and scaling strategies,” in NeurIPS, 2022.
  21. Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” TOG, 2019.
  22. H. Zhao, L. Jiang, J. Jia, P. H. S. Torr, and V. Koltun, “Point transformer,” in ICCV, 2021.
  23. L. Chen, H. Wang, H. Kong, W. Yang, and M. Ren, “Ptc-net: Point-wise transformer with sparse convolution network for place recognition,” RA-L, 2023.
  24. X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu, “Point-bert: Pre-training 3d point cloud transformers with masked point modeling,” in CVPR, 2022.
  25. Y. Pang, W. Wang, F. E. H. Tay, W. Liu, Y. Tian, and L. Yuan, “Masked autoencoders for point cloud self-supervised learning,” in ECCV, 2022.
  26. X. Yan, H. Zhan, C. Zheng, J. Gao, R. Zhang, S. Cui, and Z. Li, “Let images give you more: Point cloud cross-modal training for shape analysis,” in NeurIPS, 2022.
  27. Z. Fang, X. Li, X. Li, J. M. Buhmann, C. C. Loy, and M. Liu, “Explore in-context learning for 3d point cloud understanding,” NeurIPS, 2023.
  28. X. Liu, M. Yan, and J. Bohg, “Meteornet: Deep learning on dynamic 3d point cloud sequences,” in ICCV, 2019.
  29. H. Fan, X. Yu, Y. Ding, Y. Yang, and M. Kankanhalli, “Pstnet: Point spatio-temporal convolution on point cloud sequences,” in ICLR, 2021.
  30. H. Fan, X. Yu, Y. Yang, and M. S. Kankanhalli, “Deep hierarchical representation of point cloud videos via spatio-temporal decomposition,” TPAMI, 2022.
  31. Y. Wang, Y. Xiao, F. Xiong, W. Jiang, Z. Cao, J. T. Zhou, and J. Yuan, “3dv: 3d dynamic voxel for action recognition in depth video,” in CVPR, 2020.
  32. J. Zhong, K. Zhou, Q. Hu, B. Wang, N. Trigoni, and A. Markham, “No pain, big gain: Classify dynamic point cloud sequences with static models by fitting feature-level space-time surfaces,” in CVPR, 2022.
  33. J. Liu and D. Xu, “Geometrymotion-net: A strong two-stream baseline for 3d action recognition,” TCSVT, 2021.
  34. J. Liu, J. Guo, and D. Xu, “Geometrymotion-transformer: An end-to-end framework for 3d action recognition,” TMM, 2022.
  35. X. Wang, Q. Cui, C. Chen, and M. Liu, “Gcnext: Towards the unity of graph convolutions for human motion prediction,” in AAAI, 2024.
  36. X. Wang, Z. Fang, X. Li, X. Li, C. Chen, and M. Liu, “Skeleton-in-context: Unified skeleton sequence modeling with in-context learning,” CVPR, 2024.
  37. X. Wang, W. Zhang, C. Wang, Y. Gao, and M. Liu, “Dynamic dense graph convolutional network for skeleton-based human motion prediction,” TIP, 2023.
  38. H. Fan, Y. Yang, and M. Kankanhalli, “Point 4d transformer networks for spatio-temporal modeling in point cloud videos,” in CVPR, 2021.
  39. H. Wen, Y. Liu, J. Huang, B. Duan, and L. Yi, “Point primitive transformer for long-term 4d point cloud video understanding,” in ECCV, 2022.
  40. Y. Wei, H. Liu, T. Xie, Q. Ke, and Y. Guo, “Spatial-temporal transformer for 3d point cloud sequences,” in WACV, 2022.
  41. X. Chen, W. Liu, X. Liu, Y. Zhang, J. Han, and T. Mei, “MAPLE: masked pseudo-labeling autoencoder for semi-supervised point cloud action recognition,” in ACM MM, 2022.
  42. Z. Shen, X. Sheng, L. Wang, Y. Guo, Q. Liu, and Z. Xi, “Pointcmp: Contrastive mask prediction for self-supervised learning on point cloud videos,” in CVPR, 2023.
  43. T. Huang, B. Dong, Y. Yang, X. Huang, R. W. Lau, W. Ouyang, and W. Zuo, “Clip2point: Transfer clip to point cloud classification with image-depth pre-training,” in ICCV, 2023.
  44. T. Yang, Y. Zhu, Y. Xie, A. Zhang, C. Chen, and M. Li, “Aim: Adapting image models for efficient video understanding,” in ICLR, 2023.
  45. W. Wu, X. Wang, H. Luo, J. Wang, Y. Yang, and W. Ouyang, “Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models,” in CVPR, 2023.
  46. S. T. Wasim, M. Naseer, S. Khan, F. S. Khan, and M. Shah, “Vita-clip: Video and text adaptive clip via multimodal prompting,” in CVPR, 2023.
  47. R. Zhang, Z. Guo, W. Zhang, K. Li, X. Miao, B. Cui, Y. Qiao, P. Gao, and H. Li, “Pointclip: Point cloud understanding by CLIP,” in CVPR, 2022.
  48. J. Liu, J. Guo, and D. Xu, “Apsnet: Toward adaptive point sampling for efficient 3d action recognition,” TIP, 2022.
  49. B. Zhou, P. Wang, J. Wan, Y. Liang, F. Wang, D. Zhang, Z. Lei, H. Li, and R. Jin, “Decoupling and recoupling spatiotemporal representation for rgb-d-based motion recognition,” in CVPR, 2022.
  50. L. Yao, S. Liu, C. Li, S. Zou, S. Chen, and D. Guan, “Pa-awcnn: Two-stream parallel attention adaptive weight network for rgb-d action recognition,” in ICRA, 2022.
  51. G. Liu, J. Qian, F. Wen, X. Zhu, R. Ying, and P. Liu, “Action recognition based on 3d skeleton and rgb frame fusion,” in IROS, 2019.
  52. S. Das, S. Sharma, R. Dai, F. Brémond, and M. Thonnat, “VPN: learning video-pose embedding for activities of daily living,” in ECCV, 2020.
  53. D. Ahn, S. Kim, H. Hong, and B. Ko, “Star-transformer: A spatio-temporal cross attention transformer for human action recognition,” in WACV, 2023.
  54. B. X. Yu, Y. Liu, X. Zhang, S.-h. Zhong, and K. C. Chan, “Mmnet: A model-based multimodal network for human action recognition in rgb-d videos,” TPAMI, 2023.
  55. H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” in CVPR, 2022.
  56. A. Shahroudy, J. Liu, T. Ng, and G. Wang, “NTU RGB+D: A large scale dataset for 3d human activity analysis,” in CVPR, 2016.
  57. J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, “Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding,” TPAMI, 2019.
  58. J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman, “A short note about kinetics-600,” arXiv preprint arXiv:1808.01340, 2018.
Citations (2)

Summary

We haven't generated a summary for this paper yet.