Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Modality Co-Learning for Efficient Skeleton-based Action Recognition (2407.15706v6)

Published 22 Jul 2024 in cs.CV

Abstract: Skeleton-based action recognition has garnered significant attention due to the utilization of concise and resilient skeletons. Nevertheless, the absence of detailed body information in skeletons restricts performance, while other multimodal methods require substantial inference resources and are inefficient when using multimodal data during both training and inference stages. To address this and fully harness the complementary multimodal features, we propose a novel multi-modality co-learning (MMCL) framework by leveraging the multimodal LLMs as auxiliary networks for efficient skeleton-based action recognition, which engages in multi-modality co-learning during the training stage and keeps efficiency by employing only concise skeletons in inference. Our MMCL framework primarily consists of two modules. First, the Feature Alignment Module (FAM) extracts rich RGB features from video frames and aligns them with global skeleton features via contrastive learning. Second, the Feature Refinement Module (FRM) uses RGB images with temporal information and text instruction to generate instructive features based on the powerful generalization of multimodal LLMs. These instructive text features will further refine the classification scores and the refined scores will enhance the model's robustness and generalization in a manner similar to soft labels. Extensive experiments on NTU RGB+D, NTU RGB+D 120 and Northwestern-UCLA benchmarks consistently verify the effectiveness of our MMCL, which outperforms the existing skeleton-based action recognition methods. Meanwhile, experiments on UTD-MHAD and SYSU-Action datasets demonstrate the commendable generalization of our MMCL in zero-shot and domain-adaptive action recognition. Our code is publicly available at: https://github.com/liujf69/MMCL-Action.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. STAR-Transformer: a spatio-temporal cross attention transformer for human action recognition. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, Hawaii, USA, 3330–3339.
  2. Exploiting Recurrent Neural Networks and Leap Motion Controller for the Recognition of Sign Language and Semaphoric Hand Gestures. IEEE Transactions on Multimedia 21, 1 (2019), 234–245.
  3. Multimodal fusion via teacher-student network for indoor action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). AAAI, Canada, 3199–3207.
  4. UTD-MHAD: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In Proceedings of the IEEE International Conference on Image Processing (ICIP). IEEE, Canada, 168–172.
  5. Learning multi-granular spatio-temporal graph network for skeleton-based action recognition. In Proceedings of the ACM International Conference on Multimedia (ACM MM). ACM, Chengdu, China, 4334–4342.
  6. Channel-Wise Topology Refinement Graph Convolution for Skeleton-Based Action Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, Montreal, Canada, 13359–13368.
  7. Multi-Scale Spatial Temporal Graph Convolutional Network for Skeleton-Based Action Recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). AAAI, Canada, 1113–1122.
  8. Skeleton-Based Action Recognition with Shift Graph Convolutional Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, USA, 183–192.
  9. InfoGCN: Representation Learning for Human Skeleton-Based Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New Orleans, USA, 20186–20196.
  10. Vpn: Learning video-pose embedding for activities of daily living. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, Glasgow, USA, 72–90.
  11. VG4D: Vision-Language Model Goes 4D Video Recognition. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).
  12. Integrating Human Parsing and Pose Network for Human Action Recognition. In Proceedings of the CAAI International Conference on Artificial Intelligence (CICAI). CAAI, Fuzhou, China, 182–194.
  13. DG-STGCN: Dynamic Spatial-Temporal Modeling for Skeleton-based Action Recognition.
  14. Revisiting skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, New Orleans, USA, 2969–2978.
  15. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model.
  16. Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual Programming: Compositional visual reasoning without training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Vancouver, Canada, 14953–14962.
  17. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, USA, 770–778.
  18. Shifting perspective to see difference: A novel multi-view method for skeleton based action recognition. In Proceedings of the ACM International Conference on Multimedia (ACM MM). ACM, Lisbon, Portugal, 4987–4995.
  19. Jointly Learning Heterogeneous Features for RGB-D Activity Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 11 (2017), 2186–2200.
  20. Graph contrastive learning for skeleton-based action recognition. In Proceedings of the International Conference on Learning Representations (ICLR). OpenReview.net, Kigali, Rwanda.
  21. MMTM: Multimodal transfer module for CNN fusion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, USA, 13289–13299.
  22. A New Representation of Skeleton Sequences for 3D Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, USA, 3288–3297.
  23. Learning Clip Representations for Skeleton-Based 3D Action Recognition. IEEE Transactions on Image Processing 27, 6 (2018), 2842–2855.
  24. Cross-Modal Learning with 3D Deformable Attention for Action Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, Paris, France, 10265–10275.
  25. Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, Paris, France, 10444–10453.
  26. Memory Attention Networks for Skeleton-Based Action Recognition. IEEE Transactions on Neural Networks and Learning Systems 33, 9 (2022), 4800–4814.
  27. Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation.
  28. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In Proceedings of the International Conference on Machine Learning (ICML). ACM, Hawaii, USA, 19730–19742.
  29. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In Proceedings of the International Conference on Machine Learning (ICML). ACM, Seoul, Korea, 12888–12900.
  30. Independently Recurrent Neural Network (IndRNN): Building a Longer and Deeper RNN. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Salt Lake City, USA, 5457–5466.
  31. SemiPL: A Semi-supervised Method for Event Sound Source Localization. In Proceedings of the IEEE International Conference on Multimedia and Expo Workshop (ICMEW). IEEE, Canada.
  32. SFMViT: SlowFast Meet ViT in Chaotic World. In Proceedings of the IEEE International Conference on Multimedia and Expo Workshop (ICMEW). IEEE, Canada.
  33. TSGCNeXt: Dynamic-Static Multi-Graph Convolution for Efficient Skeleton-Based Action Recognition with Long-term Learning Potential.
  34. Referring Image Segmentation Using Text Supervision. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, Paris, France, 22124–22134.
  35. Two-Stream 3D Convolutional Neural Network for Skeleton-Based Action Recognition.
  36. Explore Human Parsing Modality for Action Recognition. CAAI Transactions on Intelligence Technology (2024).
  37. NTU RGB+D 120: A Large-Scale Benchmark for 3D Human Activity Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 10 (2020), 2684–2701.
  38. Skeleton-Based Human Action Recognition With Global Context-Aware Attention LSTM Networks. IEEE Transactions on Image Processing 27, 4 (2018), 1586–1599.
  39. Temporal Decoupling Graph Convolutional Network for Skeleton-based Gesture Recognition. IEEE Transactions on Multimedia 26 (2023), 811–823.
  40. HDBN: A Novel Hybrid Dual-branch Network for Robust Skeleton-based Action Recognition. In Proceedings of the IEEE International Conference on Multimedia and Expo Workshop (ICMEW). IEEE, Canada.
  41. Enhanced skeleton visualization for view invariant human action recognition. Pattern Recognition 68 (2017), 346–362.
  42. Novel motion patterns matter for practical skeleton-based action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vol. 37. 1701–1709.
  43. Disentangling and unifying graph convolutions for skeleton-based action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, USA, 143–152.
  44. Trivedi Neel and Sarvadevabhatla Ravi Kiran. 2022. PSUMNet: Unified Modality Part Streams Are All You Need for Efficient Pose-Based Action Recognition. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, Tel-Aviv, Israel, 211–227.
  45. Multi-Localized Sensitive Autoencoder-Attention-LSTM For Skeleton-based Action Recognition. IEEE Transactions on Multimedia 24 (2022), 1678–1690.
  46. Mfas: Multimodal fusion architecture search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, USA, 6966–6975.
  47. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML). ACM, Online, 8748–8763.
  48. NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, USA, 1010–1019.
  49. Spatio-Temporal Attention-Based LSTM Networks for 3D Action Recognition and Detection. IEEE Transactions on Image Processing 27, 7 (2018), 3459–3471.
  50. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Las Vegas, USA, 2818–2826.
  51. Ultralytics. 2022. ultralytics/yolov5: v7.0 - YOLOv5 SOTA Realtime Instance Segmentation.
  52. Hongsong Wang and Liang Wang. 2017. Modeling Temporal Dynamics and Spatial Configurations of Actions Using Two-Stream Recurrent Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, USA, 499–508.
  53. Cross-view action modeling, learning and recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Columbus, USA, 2649–2656.
  54. ActionCLIP: A New Paradigm for Video Action Recognition.
  55. Skeleton-based Action Recognition via Temporal-Channel Aggregation.
  56. Skeleton-based action recognition via adaptive cross-form learning. In Proceedings of the ACM International Conference on Multimedia (ACM MM). ACM, Lisbon, Portugal, 1670–1678.
  57. Neural Koopman Pooling: Control-Inspired Temporal Dynamics Encoding for Skeleton-Based Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Vancouver, Canada, 10597–10607.
  58. Dynamic Dense Graph Convolutional Network for Skeleton-based Human Motion Prediction. IEEE Transactions on Image Processing (2024).
  59. Interactive Spatiotemporal Token Attention Network for Skeleton-based General Interactive Action Recognition. In Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS). IEEE, Detroit, USA, 7886–7892.
  60. Spatiotemporal Multimodal Learning With 3D CNNs for Video Action Recognition. IEEE Transactions on Circuits and Systems for Video Technology 32, 3 (2022), 1250–1261.
  61. LAGA-Net: Local-and-Global Attention Network for Skeleton Based Action Recognition. IEEE Transactions on Multimedia 24 (2022), 2648–2661.
  62. Generative Action Description Prompts for Skeleton-based Action Recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, Paris, France, 10276–10285.
  63. Skeleton MixFormer: Multivariate Topology Representation for Skeleton-based Action Recognition. In Proceedings of the ACM International Conference on Multimedia (ACM MM). ACM, Ottawa, Canada, 2211–2220.
  64. Language Knowledge-Assisted Representation Learning for Skeleton-Based Action Recognition.
  65. Spatial temporal graph convolutional networks for skeleton-based action recognition. In Proceedings of the AAAI conference on artificial intelligence (AAAI). AAAI, New Orleans, USA.
  66. Dynamic GCN: Context-Enriched Topology Learning for Skeleton-Based Action Recognition. In Proceedings of the ACM International Conference on Multimedia (ACM MM). ACM, Seattle, USA, 55–63.
  67. MMNet: A Model-Based Multimodal Network for Human Action Recognition in RGB-D Videos. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2023), 3522–3538.
  68. Adding Attentiveness to the Neurons in Recurrent Neural Networks. In Proceedings of the European Conference on Computer Vision (ECCV). Springer, Munich, Germany, 135–151.
  69. Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Vancouver, Canada, 15211–15222.
  70. Learning Discriminative Representations for Skeleton Based Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Vancouver, Canada, 10608–10617.
  71. Hypergraph Transformer for Skeleton-based Action Recognition.
  72. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models.
  73. PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, Paris, France, 2639–2650.
  74. Deep graph contrastive representation learning.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jinfu Liu (9 papers)
  2. Chen Chen (753 papers)
  3. Mengyuan Liu (72 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.