Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Video Action Recognition Collaborative Learning with Dynamics via PSO-ConvNet Transformer (2302.09187v3)

Published 17 Feb 2023 in cs.CV

Abstract: Recognizing human actions in video sequences, known as Human Action Recognition (HAR), is a challenging task in pattern recognition. While Convolutional Neural Networks (ConvNets) have shown remarkable success in image recognition, they are not always directly applicable to HAR, as temporal features are critical for accurate classification. In this paper, we propose a novel dynamic PSO-ConvNet model for learning actions in videos, building on our recent work in image recognition. Our approach leverages a framework where the weight vector of each neural network represents the position of a particle in phase space, and particles share their current weight vectors and gradient estimates of the Loss function. To extend our approach to video, we integrate ConvNets with state-of-the-art temporal methods such as Transformer and Recurrent Neural Networks. Our experimental results on the UCF-101 dataset demonstrate substantial improvements of up to 9% in accuracy, which confirms the effectiveness of our proposed method. In addition, we conducted experiments on larger and more variety of datasets including Kinetics-400 and HMDB-51 and obtained preference for Collaborative Learning in comparison with Non-Collaborative Learning (Individual Learning). Overall, our dynamic PSO-ConvNet model provides a promising direction for improving HAR by better capturing the spatio-temporal dynamics of human actions in videos. The code is available at https://github.com/leonlha/Video-Action-Recognition-Collaborative-Learning-with-Dynamics-via-PSO-ConvNet-Transformer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6479–6488 (2018).
  2. Li, A. et al. Abnormal event detection in surveillance videos based on low-rank and compact coefficient dictionary learning. \JournalTitlePattern Recognition 108, 107355 (2020).
  3. Pedestrian intention prediction: A convolutional bottom-up multi-task approach. \JournalTitleTransportation research part C: emerging technologies 130, 103259 (2021).
  4. Driver yawning detection based on subtle facial action recognition. \JournalTitleIEEE Transactions on Multimedia 23, 572–583 (2020).
  5. 3d skeleton-based human action classification: A survey. \JournalTitlePattern Recognition 53, 130–147 (2016).
  6. Poppe, R. A survey on vision-based human action recognition. \JournalTitleImage and vision computing 28, 976–990 (2010).
  7. Tornado: A spatio-temporal convolutional regression network for video action proposal. In Proceedings of the IEEE International Conference on Computer Vision, 5813–5821 (2017).
  8. Right of way. \JournalTitleThe Visual Computer 29, 1277–1292 (2013).
  9. Survey on video analysis of human walking motion. \JournalTitleInternational Journal of Signal Processing, Image Processing and Pattern Recognition 7, 99–122 (2014).
  10. Action recognition by dense trajectories. In CVPR 2011, 3169–3176, DOI: 10.1109/CVPR.2011.5995407 (2011).
  11. Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, 3551–3558 (2013).
  12. Actions as space-time shapes. \JournalTitleIEEE transactions on pattern analysis and machine intelligence 29, 2247–2253 (2007).
  13. Two-stream convolutional networks for action recognition in videos. \JournalTitleAdvances in neural information processing systems 27 (2014).
  14. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, 4489–4497 (2015).
  15. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 6299–6308 (2017).
  16. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 6546–6555 (2018).
  17. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (ECCV), 305–321 (2018).
  18. Diba, A. et al. Temporal 3d convnets: New architecture and transfer learning for video classification. \JournalTitlearXiv preprint arXiv:1711.08200 (2017).
  19. Long-term temporal convolutions for action recognition. \JournalTitleIEEE transactions on pattern analysis and machine intelligence 40, 1510–1517 (2017).
  20. Donahue, J. et al. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2625–2634 (2015).
  21. Yue-Hei Ng, J. et al. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4694–4702 (2015).
  22. Action recognition in video sequences using deep bi-directional lstm with cnn features. \JournalTitleIEEE access 6, 1155–1166 (2017).
  23. Db-lstm: Densely-connected bi-directional lstm for human action recognition. \JournalTitleNeurocomputing 444, 319–331 (2021).
  24. Smart frame selection for action recognition. In Proceedings of the AAAI Conference on Artificial Intelligence, 1451–1459 (2021).
  25. An attention mechanism based convolutional lstm network for video action recognition. \JournalTitleMultimedia Tools and Applications 78, 20533–20556 (2019).
  26. Adaframe: Adaptive frame selection for fast video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1278–1287 (2019).
  27. Vaswani, A. et al. Attention is all you need. In Advances in neural information processing systems, 5998–6008 (2017).
  28. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 32–42 (2021).
  29. Dosovitskiy, A. et al. An image is worth 16x16 words: Transformers for image recognition at scale. \JournalTitlearXiv preprint arXiv:2010.11929 (2020).
  30. Arnab, A. et al. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6836–6846 (2021).
  31. Liu, Z. et al. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3202–3211 (2022).
  32. Pso-convolutional neural networks with heterogeneous learning rate. \JournalTitleIEEE Access 10, 89970–89988, DOI: 10.1109/ACCESS.2022.3201142 (2022).
  33. Human action recognition using genetic algorithms and convolutional neural networks. \JournalTitlePattern recognition 59, 199–212 (2016).
  34. Real, E. et al. Large-scale evolution of image classifiers. In International Conference on Machine Learning, 2902–2911 (PMLR, 2017).
  35. Nayman, N. et al. Xnas: Neural architecture search with expert advice. \JournalTitleAdvances in neural information processing systems 32 (2019).
  36. Noy, A. et al. Asap: Architecture search, anneal and prune. In International Conference on Artificial Intelligence and Statistics, 493–503 (PMLR, 2020).
  37. Particle swarm optimization. In Proceedings of ICNN’95-international conference on neural networks, vol. 4, 1942–1948 (IEEE, 1995).
  38. A modified particle swarm optimizer. In 1998 IEEE international conference on evolutionary computation proceedings. IEEE world congress on computational intelligence (Cat. No. 98TH8360), 69–73 (IEEE, 1998).
  39. Tu, S. et al. Modpso-cnn: an evolutionary convolution neural network with application to visual recognition. \JournalTitleSoft Computing 25, 2165–2176 (2021).
  40. Improved binary particle swarm optimization using catfish effect for feature selection. \JournalTitleExpert Systems with Applications 38, 12699–12707 (2011).
  41. Particle swarm optimization for feature selection in classification: A multi-objective approach. \JournalTitleIEEE transactions on cybernetics 43, 1656–1671 (2012).
  42. Zhang, R. Sports action recognition based on particle swarm optimization neural networks. \JournalTitleWireless Communications and Mobile Computing 2022 (2022).
  43. Basak, H. et al. A union of deep learning and swarm-based optimization for 3d human action recognition. \JournalTitleScientific Reports 12, 1–17 (2022).
  44. Rethinking recurrent neural networks and other improvements for image classification. \JournalTitlearXiv preprint arXiv:2007.15161 (2020).
  45. Motion-driven visual tempo learning for video-based action recognition. \JournalTitleIEEE Transactions on Image Processing 31, 4104–4116 (2022).
  46. Wang, L. et al. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, 20–36 (Springer, 2016).
  47. Tu, Z. et al. Action-stage emphasized spatiotemporal vlad for video action recognition. \JournalTitleIEEE Transactions on Image Processing 28, 2799–2812 (2019).
  48. Tdn: Temporal difference networks for efficient action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1895–1904 (2021).
  49. Stm: Spatiotemporal and motion encoding for action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2000–2009 (2019).
  50. Action recognition for american sign language. \JournalTitlearXiv preprint arXiv:2205.12261 (2018).
  51. Zhang, L. et al. Tn-zstad: Transferable network for zero-shot temporal activity detection. \JournalTitleIEEE Transactions on Pattern Analysis and Machine Intelligence (2022).
  52. Gao, Z. et al. A pairwise attentive adversarial spatiotemporal network for cross-domain few-shot action recognition-r2. \JournalTitleIEEE Transactions on Image Processing 30, 767–782 (2020).
  53. A general dynamic knowledge distillation method for visual analytics. \JournalTitleIEEE Transactions on Image Processing 31, 6517–6531 (2022).
  54. Liu, Z. et al. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022 (2021).
  55. Zhang, Y. et al. Vidtr: Video transformer without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, 13577–13587 (2021).
  56. Action recognition in video sequences using deep bi-directional lstm with cnn features. \JournalTitleIEEE Access 6, 1155–1166, DOI: 10.1109/ACCESS.2017.2778011 (2018).
  57. Lstm with bio inspired algorithm for action recognition in sports videos. \JournalTitleImage and Vision Computing 112, 104214, DOI: https://doi.org/10.1016/j.imavis.2021.104214 (2021).
  58. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, 3 (Citeseer, 2013).
  59. Convolutional networks and applications in vision. In Proceedings of 2010 IEEE international symposium on circuits and systems, 253–256 (IEEE, 2010).
  60. Ucf101: A dataset of 101 human actions classes from videos in the wild. \JournalTitlearXiv preprint arXiv:1212.0402 (2012).
  61. Hmdb: a large video database for human motion recognition. In 2011 International conference on computer vision, 2556–2563 (IEEE, 2011).
  62. Kay, W. et al. The kinetics human action video dataset. \JournalTitlearXiv preprint arXiv:1705.06950 (2017).
  63. Voxel51. The open-source tool for building high-quality datasets and computer vision models. https://github.com/voxel51/fiftyone (2023).
  64. Abadi, M. et al. Tensorflow: Large-scale machine learning on heterogeneous systems, software available from tensorflow. org (2015). \JournalTitleURL https://www. tensorflow. org (2015).
  65. Hiplot, interactive high-dimensionality plots. https://github.com/facebookresearch/hiplot (2020).
  66. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2818–2826 (2016).
  67. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, 6105–6114 (PMLR, 2019).
  68. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4700–4708 (2017).
  69. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778 (2016).
  70. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 1097–1105 (2012).
  71. Karpathy, A. et al. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 1725–1732 (2014).
  72. Unsupervised learning of visual representations by solving jigsaw puzzles. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VI, 69–84 (Springer, 2016).
  73. Video representation learning by dense predictive coding. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 0–0 (2019).
  74. Xu, D. et al. Self-supervised spatiotemporal learning via video clip order prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10334–10343 (2019).
  75. Self-supervised video representation learning with space-time cubic puzzles. In Proceedings of the AAAI conference on artificial intelligence, 8545–8552 (2019).
  76. P-odn: Prototype-based open deep network for open set recognition. \JournalTitleScientific reports 10, 1–13 (2020).
  77. Videomoco: Contrastive video representation learning with temporally adversarial examples. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11205–11214 (2021).
  78. Self-supervised video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2874–2884 (2022).
  79. Is space-time attention all you need for video understanding? In ICML, 4 (2021).
  80. Global matching with overlapping attention for optical flow estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 17592–17601 (2022).
  81. Fang, H.-S. et al. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. \JournalTitleIEEE Transactions on Pattern Analysis and Machine Intelligence 45, 7157–7173, DOI: 10.1109/TPAMI.2022.3222784 (2023).
Citations (11)

Summary

We haven't generated a summary for this paper yet.