GTAutoAct: An Automatic Datasets Generation Framework Based on Game Engine Redevelopment for Action Recognition (2401.13414v1)
Abstract: Current datasets for action recognition tasks face limitations stemming from traditional collection and generation methods, including the constrained range of action classes, absence of multi-viewpoint recordings, limited diversity, poor video quality, and labor-intensive manually collection. To address these challenges, we introduce GTAutoAct, a innovative dataset generation framework leveraging game engine technology to facilitate advancements in action recognition. GTAutoAct excels in automatically creating large-scale, well-annotated datasets with extensive action classes and superior video quality. Our framework's distinctive contributions encompass: (1) it innovatively transforms readily available coordinate-based 3D human motion into rotation-orientated representation with enhanced suitability in multiple viewpoints; (2) it employs dynamic segmentation and interpolation of rotation sequences to create smooth and realistic animations of action; (3) it offers extensively customizable animation scenes; (4) it implements an autonomous video capture and processing pipeline, featuring a randomly navigating camera, with auto-trimming and labeling functionalities. Experimental results underscore the framework's robustness and highlights its potential to significantly improve action recognition model training.
- 2008-2022 OpenIV. Openiv. https://openiv.com. Accessed March 8, 2023.
- Nctu-gtav360: A 360° action recognition video dataset. In 2019 IEEE 21st International Workshop on Multimedia Signal Processing (MMSP), pages 1–5, 2019.
- Autodesk. 3ds max 2023. https://www.autodesk.co.jp/products/3ds-max/overview?term=1-YEAR&tab=subscription. Accessed March 8, 2023.
- G3d: A gaming action dataset and real time action recognition evaluation framework. In 2012 IEEE Computer society conference on computer vision and pattern recognition workshops, pages 7–12. IEEE, 2012.
- Long-term human motion prediction with scene context. CoRR, abs/2007.03672, 2020.
- Quo vadis, action recognition? A new model and the kinetics dataset. CoRR, abs/1705.07750, 2017.
- A short note on the kinetics-700 human action dataset. CoRR, abs/1907.06987, 2019.
- dexyfex. Codeworker. https://de.gta5-mods.com/tools/codewalker-gtav-interactive-3d-map. Accessed March 8, 2023.
- Holistic large scale video understanding. CoRR, abs/1904.11451, 2019.
- Learning to detect and track visible and occluded body joints in a virtual world. In European Conference on Computer Vision (ECCV), 2018.
- Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition, 2020.
- Fivem. Fivem. https://fivem.net. Accessed March 8, 2023.
- Jerome H Friedman. A variable span smoother. Laboratory for Computational Statistics, Department of Statistics, Stanford …, 1984.
- Activitynet: A large-scale video benchmark for human activity understanding. 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970, 2015.
- Eldersim: A synthetic data generation platform for human action recognition in eldercare applications. IEEE Access, 11:9279–9294, 2023.
- Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.
- Hmdb: A large video database for human motion recognition. In 2011 International Conference on Computer Vision, pages 2556–2563, 2011.
- Uniformer: Unified transformer for efficient spatial-temporal representation learning. In International Conference on Learning Representations, 2022.
- Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE International Conference on Computer Vision, 2019.
- Tam: Temporal adaptive module for video recognition. arXiv preprint arXiv:2005.06803, 2020.
- Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3202–3211, 2022.
- Learning to localize actions from moments. CoRR, abs/2008.13705, 2020.
- Jointformer: Single-frame lifting transformer with error prediction and refinement for 3d human pose estimation. 26TH International Conference on Pattern Recognition, ICPR 2022, 2022.
- Moments in time dataset: one million videos for event understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–8, 2019.
- Multi-moments in time: Learning and interpreting models for multi-action video understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):9434–9445, 2021.
- Let’s play for action: Recognizing activities of daily living by learning from life simulation video games. CoRR, abs/2107.05617, 2021.
- Ntu rgb+d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1010–1019, 2016.
- Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2616–2625, 2020a.
- Temporal interlacing network. AAAI, 2020b.
- Ucf101: A dataset of 101 human actions classes from videos in the wild. ArXiv, abs/1212.0402, 2012.
- Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems, 2022.
- Synthetic humans for action recognition from unseen viewpoints. In IJCV, 2021.
- Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision, pages 20–36. Springer, 2016.
- Non-local neural networks. CVPR, 2018.
- Temporal pyramid network for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- H3wb: Human3.6m 3d wholebody dataset and benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 20166–20177, 2023.