An Animation-based Augmentation Approach for Action Recognition from Discontinuous Video (2404.06741v4)
Abstract: Action recognition, an essential component of computer vision, plays a pivotal role in multiple applications. Despite significant improvements brought by Convolutional Neural Networks (CNNs), these models suffer performance declines when trained with discontinuous video frames, which is a frequent scenario in real-world settings. This decline primarily results from the loss of temporal continuity, which is crucial for understanding the semantics of human actions. To overcome this issue, we introduce the 4A (Action Animation-based Augmentation Approach) pipeline, which employs a series of sophisticated techniques: starting with 2D human pose estimation from RGB videos, followed by Quaternion-based Graph Convolution Network for joint orientation and trajectory prediction, and Dynamic Skeletal Interpolation for creating smoother, diversified actions using game engine technology. This innovative approach generates realistic animations in varied game environments, viewed from multiple viewpoints. In this way, our method effectively bridges the domain gap between virtual and real-world data. In experimental evaluations, the 4A pipeline achieves comparable or even superior performance to traditional training approaches using real-world data, while requiring only 10% of the original data volume. Additionally, our approach demonstrates enhanced performance on In-the-wild videos, marking a significant advancement in the field of action recognition.
- S. Ardianto and H.-M. Hang. Nctu-gtav360: A 360° action recognition video dataset. In 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), pages 1–5, 2019. 10.1109/MMSP.2019.8901740.
- Autodesk. 3ds max 2023. https://www.autodesk.co.jp/products/3ds-max/overview?term=1-YEAR&tab=subscription. Accessed March 8, 2023.
- G3d: A gaming action dataset and real time action recognition evaluation framework. In 2012 IEEE Computer society conference on computer vision and pattern recognition workshops, pages 7–12. IEEE, 2012.
- Long-term human motion prediction with scene context. CoRR, abs/2007.03672, 2020. URL https://arxiv.org/abs/2007.03672.
- J. Carreira and A. Zisserman. Quo vadis, action recognition? A new model and the kinetics dataset. CoRR, abs/1705.07750, 2017. URL http://arxiv.org/abs/1705.07750.
- Synthesizing training images for boosting human 3d pose estimation. In 2016 Fourth International Conference on 3D Vision (3DV), pages 479–488. IEEE, 2016.
- Deductive learning for weakly-supervised 3d human pose estimation via uncalibrated cameras. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 1089–1096, 2021.
- Procedural generation of videos to train deep action recognition networks. 2017.
- Holistic large scale video understanding. CoRR, abs/1904.11451, 2019. URL http://arxiv.org/abs/1904.11451.
- Learning to detect and track visible and occluded body joints in a virtual world. In European Conference on Computer Vision (ECCV), 2018.
- C. Feichtenhofer. X3d: Expanding architectures for efficient video recognition, 2020.
- Flownet: Learning optical flow with convolutional networks. CoRR, abs/1504.06852, 2015. URL http://arxiv.org/abs/1504.06852.
- Fivem. Fivem. https://fivem.net. Accessed March 8, 2023.
- J. H. Friedman. A variable span smoother. Laboratory for Computational Statistics, Department of Statistics, Stanford …, 1984.
- Continuous human action recognition for human-machine interaction: a review. ACM Computing Surveys, 55(13s):1–38, 2023.
- Learning camera viewpoint using cnn to improve 3d body pose estimation. In 2016 fourth international conference on 3D vision (3DV), pages 685–693. IEEE, 2016.
- Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? CoRR, abs/1711.09577, 2017. URL http://arxiv.org/abs/1711.09577.
- Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.
- Activitynet: A large-scale video benchmark for human activity understanding. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970, 2015.
- Squeeze-and-excitation networks. CoRR, abs/1709.01507, 2017. URL http://arxiv.org/abs/1709.01507.
- Conditional directed graph convolution for 3d human pose estimation. CoRR, abs/2107.07797, 2021. URL https://arxiv.org/abs/2107.07797.
- Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.
- Whole-body human pose estimation in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), 2020.
- Traffic monitoring and accident detection at intersections. IEEE transactions on Intelligent transportation systems, 1(2):108–118, 2000.
- Learning 3d human dynamics from video. CoRR, abs/1812.01601, 2018. URL http://arxiv.org/abs/1812.01601.
- T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. CoRR, abs/1609.02907, 2016. URL http://arxiv.org/abs/1609.02907.
- M. Korban and X. Li. Ddgcn: A dynamic directed graph convolutional network for action recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XX 16, pages 761–776. Springer, 2020.
- Maximum-margin structured learning with deep networks for 3d human pose estimation. In Proceedings of the IEEE international conference on computer vision, pages 2848–2856, 2015.
- Data, language and graph-based reasoning methods for identification of human malicious behaviors in nuclear security. Expert Systems with Applications, 236:121367, 2024. ISSN 0957-4174. https://doi.org/10.1016/j.eswa.2023.121367. URL https://www.sciencedirect.com/science/article/pii/S0957417423018699.
- Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, 2019.
- J. Liu and A. Mian. Learning human pose models from synthesized data for robust RGB-D action recognition. CoRR, abs/1707.00823, 2017. URL http://arxiv.org/abs/1707.00823.
- Temporally coherent full 3d mesh human pose recovery from monocular video. arXiv preprint arXiv:1906.00161, 2019.
- Attention mechanism exploits temporal contexts: Real-time 3d human pose reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5064–5073, 2020a.
- Tam: Temporal adaptive module for video recognition. arXiv preprint arXiv:2005.06803, 2020b.
- SMPL: A skinned multi-person linear model. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 34(6):248:1–248:16, Oct. 2015.
- Smpl: A skinned multi-person linear model. In Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pages 851–866. 2023.
- Jointformer: Single-frame lifting transformer with error prediction and refinement for 3d human pose estimation. 26TH International Conference on Pattern Recognition, ICPR 2022, 2022.
- F. Lv and R. Nevatia. Single view human action recognition using key pose matching and viterbi path searching. In 2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2007. 10.1109/CVPR.2007.383131.
- Learning to dress 3d people in generative clothing. In Computer Vision and Pattern Recognition (CVPR), June 2020.
- Context modeling in 3d human pose estimation: A unified perspective. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6238–6247, 2021.
- A simple yet effective baseline for 3d human pose estimation. In Proceedings of the IEEE International Conference on Computer Vision, pages 2640–2649, 2017.
- G. Moon and K. M. Lee. I2l-meshnet: Image-to-lixel prediction network for accurate 3d human pose and mesh estimation from a single rgb image. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 752–768. Springer, 2020.
- STAR: A sparse trained articulated human body regressor. In European Conference on Computer Vision (ECCV), pages 598–613, 2020. URL https://star.is.tue.mpg.de.
- 3d human pose estimation in video with temporal convolutions and semi-supervised training. CoRR, abs/1811.11742, 2018a. URL http://arxiv.org/abs/1811.11742.
- Quaternet: A quaternion-based recurrent model for human motion. CoRR, abs/1805.06485, 2018b. URL http://arxiv.org/abs/1805.06485.
- 3d human pose estimation in video with temporal convolutions and semi-supervised training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7753–7762, 2019.
- Video-based human action recognition using deep learning: A review, 2022.
- Pose-normalized image generation for person re-identification. In Proceedings of the European conference on computer vision (ECCV), pages 650–667, 2018.
- H. Rahmani and A. Mian. Learning a non-linear knowledge transfer model for cross-view action recognition. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2458–2466, 2015. 10.1109/CVPR.2015.7298860.
- H. Rahmani and A. Mian. 3d action recognition from novel viewpoints. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1506–1515, 2016. 10.1109/CVPR.2016.167.
- Learning a deep model for human action recognition from novel viewpoints. CoRR, abs/1602.00828, 2016. URL http://arxiv.org/abs/1602.00828.
- Let’s play for action: Recognizing activities of daily living by learning from life simulation video games. CoRR, abs/2107.05617, 2021. URL https://arxiv.org/abs/2107.05617.
- Embodied hands: Modeling and capturing hands and bodies together. ACM Transactions on Graphics, (Proc. SIGGRAPH Asia), 36(6), Nov. 2017.
- Computer vision techniques for construction safety and health monitoring. Advanced Engineering Informatics, 29(2):239–251, 2015. ISSN 1474-0346. https://doi.org/10.1016/j.aei.2015.02.001. URL https://www.sciencedirect.com/science/article/pii/S1474034615000269. Infrastructure Computer Vision.
- Ntu rgb+d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016.
- Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2616–2625, 2020.
- X-avatar: Expressive human avatars. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16911–16921, 2023.
- Skeleton-based action recognition with directed graph neural networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7912–7921, 2019.
- Real-time human pose recognition in parts from single depth images. In CVPR 2011, pages 1297–1304, 2011a. 10.1109/CVPR.2011.5995316.
- Real-time human pose recognition in parts from single depth images. In CVPR 2011, pages 1297–1304. Ieee, 2011b.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. ArXiv, abs/1212.0402, 2012.
- Render for CNN: viewpoint estimation in images using cnns trained with rendered 3d model views. CoRR, abs/1505.05641, 2015. URL http://arxiv.org/abs/1505.05641.
- Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems, 2022.
- Learning from synthetic humans. CoRR, abs/1701.01370, 2017. URL http://arxiv.org/abs/1701.01370.
- Synthetic humans for action recognition from unseen viewpoints. In IJCV, 2021.
- Motion guided 3d pose estimation from videos. In European Conference on Computer Vision, pages 764–780. Springer, 2020.
- Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
- Actionclip: A new paradigm for video action recognition. CoRR, abs/2109.08472, 2021. URL https://arxiv.org/abs/2109.08472.
- Non-local neural networks. CVPR, 2018.
- Probabilistic monocular 3d human pose estimation with normalizing flows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11199–11208, 2021.
- Rethinking spatiotemporal feature learning for video understanding. CoRR, abs/1712.04851, 2017. URL http://arxiv.org/abs/1712.04851.
- Spatial temporal graph convolutional networks for skeleton-based action recognition. In Thirty-second AAAI conference on artificial intelligence, 2018.
- Temporal pyramid network for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Gla-gcn: Global-local adaptive graph convolutional network for 3d human. arXiv preprint arXiv:2307.05853, 2023.
- Semantic graph convolutional networks for 3d human pose regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3425–3435, 2019.
- Hemlets pose: Learning part-centric heatmap triplets for accurate 3d human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2344–2353, 2019.
- H3wb: Human3.6m 3d wholebody dataset and benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20166–20177, October 2023.