S3Aug: Segmentation, Sampling, and Shift for Action Recognition (2310.14556v1)
Abstract: Action recognition is a well-established area of research in computer vision. In this paper, we propose S3Aug, a video data augmenatation for action recognition. Unlike conventional video data augmentation methods that involve cutting and pasting regions from two videos, the proposed method generates new videos from a single training video through segmentation and label-to-image transformation. Furthermore, the proposed method modifies certain categories of label images by sampling to generate a variety of videos, and shifts intermediate features to enhance the temporal coherency between frames of the generate videos. Experimental results on the UCF101, HMDB51, and Mimetics datasets demonstrate the effectiveness of the proposed method, paricularlly for out-of-context videos of the Mimetics dataset.
- Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6836–6846, 2021.
- Is space-time attention all you need for video understanding? In Proceedings of the 38th International Conference on Machine Learning, pages 813–824. PMLR, 2021.
- Coco-stuff: Thing and stuff classes in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Survey on videos data augmentation for deep learning models. Future Internet, 14(3), 2022.
- Masked-attention mask transformer for universal image segmentation. 2022.
- Enabling detailed action recognition evaluation through video dataset augmentation. In Advances in Neural Information Processing Systems, pages 39020–39033. Curran Associates, Inc., 2022.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, 2019. Association for Computational Linguistics.
- Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020a.
- Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020b.
- Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
- Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2918–2928, 2021.
- Generative adversarial nets. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2014.
- Learn2augment: Learning to composite videos for data augmentation in action recognition. In Computer Vision – ECCV 2022, pages 242–259, Cham, 2022. Springer Nature Switzerland.
- The ”something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
- Temporal cross-attention for action recognition. In Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops, pages 276–288, 2022.
- Human action recognition without human. In Computer Vision – ECCV 2016 Workshops, pages 11–17, Cham, 2016. Springer International Publishing.
- Video diffusion models. In Advances in Neural Information Processing Systems, pages 8633–8646. Curran Associates, Inc., 2022.
- Video action understanding. IEEE Access, 9:134611–134637, 2021.
- A survey on generative adversarial networks: Variants, applications, and training. ACM Computing Surveys, 54(8), 2021.
- The kinetics human action video dataset. CoRR, abs/1705.06950, 2017.
- Objectmix: Data augmentation by copy-pasting objects in videos for action recognition. In Proceedings of the 4th ACM International Conference on Multimedia in Asia, New York, NY, USA, 2022. Association for Computing Machinery.
- Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
- Panoptic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Human action recognition and prediction: A survey. Int. J. Comput. Vis., 130(5):1366–1401, 2022.
- HMDB: A large video database for human motion recognition. In IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, pages 2556–2563. IEEE Computer Society, 2011.
- Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- Microsoft COCO: common objects in context. In Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V, pages 740–755. Springer, 2014.
- Videofusion: Decomposed diffusion models for high-quality video generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10209–10218, 2023.
- Semantic image synthesis with spatially-adaptive normalization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
- A survey on image data augmentation for deep learning. Journal of Big Data, 6:60, 2019.
- UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
- Vision transformers for action recognition: A survey. CoRR, abs/2209.05700, 2022.
- When shift operation meets vision transformer: An extremely simple alternative to attention mechanism. CoRR, abs/2201.10801, 2022.
- Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, 2023.
- Video-to-video synthesis. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2018a.
- Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018b.
- Mimetics: Towards understanding human actions out of context. International Journal of Computer Vision, 129(5):1675–1690, 2021.
- Adversarial action data augmentation for similar gesture action recognition. In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2019.
- Videogpt: Video generation using VQ-VAE and transformers. CoRR, abs/2104.10157, 2021.
- Generative adversarial network in medical imaging: A review. Medical Image Analysis, 58:101552, 2019.
- Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
- Videomix: Rethinking data augmentation for video classification, 2020.
- Token shift transformer for video classification. In Proceedings of the 29th ACM International Conference on Multimedia, page 917â925, New York, NY, USA, 2021. Association for Computing Machinery.
- Adding conditional control to text-to-image diffusion models. CoRR, abs/2302.05543, 2023.
- Self-paced video data augmentation by generative adversarial networks with insufficient samples. In Proceedings of the 28th ACM International Conference on Multimedia, page 1652â1660, New York, NY, USA, 2020. Association for Computing Machinery.
- Toward multimodal image-to-image translation, 2017.
- Learning representational invariances for data-efficient action recognition. Computer Vision and Image Understanding, page 103597, 2022.