Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion Recognition (2211.09146v2)

Published 16 Nov 2022 in cs.CV and cs.MM

Abstract: Motion recognition is a promising direction in computer vision, but the training of video classification models is much harder than images due to insufficient data and considerable parameters. To get around this, some works strive to explore multimodal cues from RGB-D data. Although improving motion recognition to some extent, these methods still face sub-optimal situations in the following aspects: (i) Data augmentation, i.e., the scale of the RGB-D datasets is still limited, and few efforts have been made to explore novel data augmentation strategies for videos; (ii) Optimization mechanism, i.e., the tightly space-time-entangled network structure brings more challenges to spatiotemporal information modeling; And (iii) cross-modal knowledge fusion, i.e., the high similarity between multimodal representations caused to insufficient late fusion. To alleviate these drawbacks, we propose to improve RGB-D-based motion recognition both from data and algorithm perspectives in this paper. In more detail, firstly, we introduce a novel video data augmentation method dubbed ShuffleMix, which acts as a supplement to MixUp, to provide additional temporal regularization for motion recognition. Secondly, a Unified Multimodal De-coupling and multi-stage Re-coupling framework, termed UMDR, is proposed for video representation learning. Finally, a novel cross-modal Complement Feature Catcher (CFCer) is explored to mine potential commonalities features in multimodal information as the auxiliary fusion stream, to improve the late fusion results. The seamless combination of these novel designs forms a robust spatiotemporal representation and achieves better performance than state-of-the-art methods on four public motion datasets. Specifically, UMDR achieves unprecedented improvements of +4.5% on the Chalearn IsoGD dataset. Our code is available at https://github.com/zhoubenjia/MotionRGBD-PAMI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (83)
  1. J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  2. M. Abavisani, H. R. V. Joze, and V. M. Patel, “Improving the performance of unimodal dynamic hand-gesture recognition with multimodal training,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1165–1174.
  3. J. Duan, J. Wan, S. Zhou, X. Guo, and S. Li, “A unified framework for multi-modal isolated gesture recognition,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 14, pp. 1–16, 02 2018.
  4. J. Wan, Q. Ruan, W. Li, and S. Deng, “One-shot learning gesture recognition from rgb-d data using bag of features,” JMLR, vol. 14, no. 1, pp. 2549–2582, 2013.
  5. H. Wang, P. Wang, Z. Song, and W. Li, “Large-scale multimodal gesture recognition using heterogeneous networks,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  6. P. Wang, W. Li, Z. Gao, C. Tang, and P. O. Ogunbona, “Depth pooling based large-scale 3-d action recognition with convolutional neural networks,” IEEE Transactions on Multimedia, 2018.
  7. J. Wan, G. Guo, and S. Z. Li, “Explore efficient local features from rgb-d data for one-shot learning gesture recognition,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 8, pp. 1626–1639, 2015.
  8. P. Wang, W. Li, J. Wan, P. Ogunbona, and X. Liu, “Cooperative training of deep aggregation networks for rgb-d action recognition,” in AAAI, 2018.
  9. A. M. De Boissiere and R. Noumeir, “Infrared and 3d skeleton feature fusion for rgb-d action recognition,” IEEE Access, vol. 8, pp. 168 297–168 308, 2020.
  10. H. R. V. Joze, A. Shaban, M. L. Iuzzolino, and K. Koishida, “Mmtm: Multimodal transfer module for cnn fusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.
  11. Z. Yu, B. Zhou, J. Wan, P. Wang, H. Chen, X. Liu, S. Z. Li, and G. Zhao, “Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition,” IEEE Transactions on Image Processing, 2021.
  12. X. Li, Y. Hou, P. Wang, Z. Gao, M. Xu, and W. Li, “Trear: Transformer-based rgb-d egocentric action recognition,” IEEE Transactions on Cognitive and Developmental Systems, vol. 14, no. 1, pp. 246–252, 2021.
  13. B. Zhou, Y. Li, and J. Wan, “Regional attention with architecture-rebuilt 3d network for rgb-d gesture recognition,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 4, pp. 3563–3571, May 2021. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/16471
  14. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=r1Ddp1-Rb
  15. D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.
  16. M. E. Kalfaoglu, S. Kalkan, and A. A. Alatan, “Late temporal modeling in 3d cnn architectures with bert for action recognition,” in European Conference on Computer Vision.   Springer, 2020, pp. 731–747.
  17. Z. Miao, Z. Wang, X. Cheng, and Q. Qiu, “Spatiotemporal joint filter decomposition in 3d convolutional neural networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 3376–3388, 2021.
  18. G. Zhu, L. Zhang, L. Yang, L. Mei, S. A. A. Shah, M. Bennamoun, and P. Shen, “Redundancy and attention in convolutional lstm for gesture recognition,” IEEE transactions on neural networks and learning systems, vol. 31, no. 4, pp. 1323–1335, 2019.
  19. B. Zhou, P. Wang, J. Wan, Y. Liang, F. Wang, D. Zhang, Z. Lei, H. Li, and R. Jin, “Decoupling and recoupling spatiotemporal representation for rgb-d-based motion recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 20 154–20 163.
  20. C. Shorten and T. M. Khoshgoftaar, “A survey on image data augmentation for deep learning,” Journal of big data, vol. 6, no. 1, pp. 1–48, 2019.
  21. H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou, “Training data-efficient image transformers & distillation through attention,” in International Conference on Machine Learning, 2021, pp. 10 347–10 357.
  22. R. Balestriero, L. Bottou, and Y. LeCun, “The effects of regularization and data augmentation are class dependent,” arXiv preprint arXiv:2204.03632, 2022.
  23. M. Xu, S. Yoon, A. Fuentes, and D. S. Park, “A comprehensive survey of image augmentation techniques for deep learning,” arXiv preprint arXiv:2205.01491, 2022.
  24. J. Yue-Hei Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici, “Beyond short snippets: Deep networks for video classification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 4694–4702.
  25. M. Hoai and A. Zisserman, “Improving human action recognition using score distribution and ranking,” in Asian conference on computer vision.   Springer, 2014, pp. 3–20.
  26. B. Fernando, E. Gavves, J. M. Oramas, A. Ghodrati, and T. Tuytelaars, “Modeling video evolution for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  27. okankop, “Video augmentation techniques for deep learning,” https://github.com/okankop/vidaug, 2021.
  28. S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6023–6032.
  29. S. Yun, S. J. Oh, B. Heo, D. Han, and J. Kim, “Videomix: Rethinking data augmentation for video classification,” arXiv preprint arXiv:2012.03457, 2020.
  30. J. Wang, Y. Gao, K. Li, Y. Lin, A. J. Ma, H. Cheng, P. Peng, F. Huang, R. Ji, and X. Sun, “Removing the background by adding the background: Towards background robust self-supervised video representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 11 804–11 813.
  31. H. Wu, C. Song, S. Yue, Z. Wang, J. Xiao, and Y. Liu, “Dynamic video mix-up for cross-domain action recognition,” Neurocomputing, vol. 471, pp. 358–368, 2022.
  32. G. Zhu, L. Zhang, L. Mei, J. Shao, J. Song, and P. Shen, “Large-scale isolated gesture recognition using pyramidal 3d convolutional networks,” in ICPR.   IEEE, 2016, pp. 19–24.
  33. X. Chai, Z. Liu, F. Yin, Z. Liu, and X. Chen, “Two streams recurrent neural networks for large-scale continuous gesture recognition,” in 2016 23rd International Conference on Pattern Recognition (ICPR), 2016, pp. 31–36.
  34. P. Wang, W. Li, Z. Gao, Y. Zhang, C. Tang, and P. Ogunbona, “Scene flow to action map: A new representation for rgb-d based action recognition with convolutional neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 595–604.
  35. G. Zhu, L. Zhang, P. Shen, and J. Song, “Multimodal gesture recognition using 3-d convolution and convolutional lstm,” Ieee Access, vol. 5, pp. 4517–4524, 2017.
  36. A. Shahroudy, T.-T. Ng, Y. Gong, and G. Wang, “Deep multimodal feature analysis for action recognition in rgb+d videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, no. 5, pp. 1045–1058, 2018.
  37. Y. Li, Q. Miao, K. Tian, Y. Fan, X. Xu, Z. Ma, and J. Song, “Large-scale gesture recognition with a fusion of rgb-d data based on optical flow and the c3d model,” PR Letters, vol. 119, pp. 187–194, 2019.
  38. B. Zhou, J. Wan, Y. Liang, and G. Guo, “Adaptive cross-fusion learning for multi-modal gesture recognition,” Virtual Reality & Intelligent Hardware, vol. 3, no. 3, pp. 235–247, 2021.
  39. Y. Ye and Y. Tian, “Embedding sequential information into spatiotemporal features for action recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2016.
  40. Y. Kong and Y. Fu, “Bilinear heterogeneous information machine for rgb-d action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1054–1062.
  41. Z. Zhang and D. Crandall, “Hierarchically decoupled spatial-temporal contrast for self-supervised video representation learning,” arXiv preprint arXiv:2011.11261, 2020.
  42. D. He, Z. Zhou, C. Gan, F. Li, X. Liu, Y. Li, L. Wang, and S. Wen, “Stnet: Local and global spatial-temporal modeling for action recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 8401–8408.
  43. L. Carratino, M. Cissé, R. Jenatton, and J.-P. Vert, “On mixup regularization,” arXiv preprint arXiv:2006.06049, 2020.
  44. L. Zhang, Z. Deng, K. Kawaguchi, A. Ghorbani, and J. Zou, “How does mixup help with robustness and generalization?” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=8yKEo06dKNo
  45. C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, ser. AAAI’17.   AAAI Press, 2017, p. 4278–4284.
  46. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  47. C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202–6211.
  48. T. Xiao, M. Singh, E. Mintun, T. Darrell, P. Dollár, and R. Girshick, “Early convolutions help transformers see better,” Advances in Neural Information Processing Systems, vol. 34, pp. 30 392–30 400, 2021.
  49. P. Wang, X. Wang, F. Wang, M. Lin, S. Chang, W. Xie, H. Li, and R. Jin, “Kvt: k-nn attention for boosting vision transformers,” arXiv preprint arXiv:2106.00515, 2021.
  50. G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
  51. J. Shlens, “A tutorial on principal component analysis,” CoRR, vol. abs/1404.1100, 2014. [Online]. Available: http://arxiv.org/abs/1404.1100
  52. Y. Li, Q. Miao, K. Tian, Y. Fan, X. Xu, R. Li, and J. Song, “Large-scale gesture recognition with a fusion of rgb-d data based on saliency theory and c3d model,” TCSVT, vol. 28, no. 10, pp. 2956–2964, 2018.
  53. T.-K. Hu, Y.-Y. Lin, and P.-C. Hsiu, “Learning adaptive hidden layers for mobile gesture recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
  54. D. Hazarika, R. Zimmermann, and S. Poria, “Misa: Modality-invariant and -specific representations for multimodal sentiment analysis,” ser. MM ’20.   New York, NY, USA: Association for Computing Machinery, 2020, p. 1122–1131.
  55. X. Chen, K.-Y. Lin, J. Wang, W. Wu, C. Qian, H. Li, and G. Zeng, “Bi-directional cross-modality feature propagation with separation-and-aggregation gate for rgb-d semantic segmentation,” in European Conference on Computer Vision.   Springer, 2020, pp. 561–577.
  56. A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “Ntu rgb+ d: A large scale dataset for 3d human activity analysis,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1010–1019.
  57. P. Molchanov, X. Yang, S. Gupta, K. Kim, S. Tyree, and J. Kautz, “Online detection and classification of dynamic hand gestures with recurrent 3d convolutional neural network,” in CVPR, 2016, pp. 4207–4215.
  58. J. Wan, Y. Zhao, S. Zhou, I. Guyon, S. Escalera, and S. Z. Li, “Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2016, pp. 56–64.
  59. J. Wan, C. Lin, L. Wen, Y. Li, Q. Miao, S. Escalera, G. Anbarjafari, I. Guyon, G. Guo, and S. Z. Li, “Chalearn looking at people: Isogd and congd large-scale rgb-d gesture recognition,” IEEE Transactions on Cybernetics, pp. 1–12, 2020.
  60. Y. Tang, Z. Wang, J. Lu, J. Feng, and J. Zhou, “Multi-stream deep neural networks for rgb-d egocentric action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 10, pp. 3001–3015, 2019.
  61. H. Wang, Z. Song, W. Li, and P. Wang, “A hybrid network for large-scale action recognition from rgb and depth modalities,” Sensors, vol. 20, no. 11, p. 3305, 2020.
  62. Z. Yu, Y. Qin, X. Li, Z. Wang, C. Zhao, Z. Lei, and G. Zhao, “Multi-modal face anti-spoofing based on central difference networks,” in CVPR Workshops, 2020, pp. 650–651.
  63. I. Loshchilov and F. Hutter, “Sgdr: Stochastic gradient descent with warm restarts,” arXiv preprint arXiv:1608.03983, 2016.
  64. J. Materzynska, G. Berger, I. Bax, and R. Memisevic, “The jester dataset: A large-scale video dataset of human gestures,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
  65. P. Narayana, R. Beveridge, and B. A. Draper, “Gesture recognition: Focus on the hands,” in CVPR, 2018, pp. 5235–5244.
  66. L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Skeleton-based action recognition with directed graph neural networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7912–7921.
  67. K. Cheng, Y. Zhang, X. He, W. Chen, and H. Lu, “Skeleton-based action recognition with shift graph convolutional network,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  68. K. Cheng, Y. Zhang, C. Cao, L. Shi, J. Cheng, and H. Lu, “Decoupling gcn with dropgraph module for skeleton-based action recognition,” in Computer Vision–ECCV 2020: 16th European Conference.   Springer, 2020, pp. 536–553.
  69. Y. Chen, Z. Zhang, C. Yuan, B. Li, Y. Deng, and W. Hu, “Channel-wise topology refinement graph convolution for skeleton-based action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 359–13 368.
  70. M. Zolfaghari, G. L. Oliveira, N. Sedaghat, and T. Brox, “Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 2904–2913.
  71. J. X. A. B. C, C. J. A. B. C, T. D. D, W. X. A. B. C, and F. W. A. B. C, “The spatial laplacian and temporal energy pyramid representation for human action recognition using depth sequences - sciencedirect,” Knowledge-Based Systems, vol. 122, no. C, pp. 64–74, 2017.
  72. J.-F. Hu, W.-S. Zheng, J. Pan, J. Lai, and J. Zhang, “Deep bilinear learning for rgb-d action recognition,” in Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
  73. H. Fan, Y. Yang, and M. Kankanhalli, “Point 4d transformer networks for spatio-temporal modeling in point cloud videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021, pp. 14 204–14 213.
  74. H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022, pp. 2969–2978.
  75. L. Zhang, G. Zhu, P. Shen, J. Song, S. Afaq Shah, and M. Bennamoun, “Learning spatiotemporal features using 3dcnn and convolutional lstm for gesture recognition,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 3120–3128.
  76. H. Chen, Y. Li, H. Fang, W. Xin, Z. Lu, and Q. Miao, “Multi-scale attention 3d convolutional network for multimodal gesture recognition,” Sensors, vol. 22, no. 6, p. 2405, 2022.
  77. Q. Miao, Y. Li, W. Ouyang, Z. Ma, X. Xu, W. Shi, and X. Cao, “Multimodal gesture recognition based on the resc3d network,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 3047–3055.
  78. A. Elboushaki, R. Hannane, K. Afdel, and L. Koutti, “Multid-cnn: A multi-dimensional feature learning approach based on deep convolutional networks for gesture recognition in rgb-d image sequences,” Expert Systems with Applications, vol. 139, p. 112829, 2020.
  79. V. Gupta, S. K. Dwivedi, R. Dabral, and A. Jain, “Progression modelling for online and early gesture detection,” in 2019 International Conference on 3D Vision (3DV), 2019, pp. 289–297.
  80. X. Yang, P. Molchanov, and J. Kautz, “Making convolutional networks recurrent for visual sequence learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
  81. A. D’Eusanio, A. Simoni, S. Pini, G. Borghi, R. Vezzani, and R. Cucchiara, “A transformer-based network for dynamic hand gesture recognition,” in 2020 International Conference on 3D Vision (3DV).   IEEE, 2020, pp. 623–632.
  82. O. Köpüklü, A. Gunduz, N. Kose, and G. Rigoll, “Real-time hand gesture detection and classification using convolutional neural networks,” in 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019).   IEEE, 2019, pp. 1–8.
  83. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal segment networks: Towards good practices for deep action recognition,” in European conference on computer vision.   Springer, 2016, pp. 20–36.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Benjia Zhou (12 papers)
  2. Pichao Wang (65 papers)
  3. Jun Wan (79 papers)
  4. Yanyan Liang (29 papers)
  5. Fan Wang (312 papers)
Citations (11)
Github Logo Streamline Icon: https://streamlinehq.com