Learning Group Activity Features Through Person Attribute Prediction (2403.02753v2)
Abstract: This paper proposes Group Activity Feature (GAF) learning in which features of multi-person activity are learned as a compact latent vector. Unlike prior work in which the manual annotation of group activities is required for supervised learning, our method learns the GAF through person attribute prediction without group activity annotations. By learning the whole network in an end-to-end manner so that the GAF is required for predicting the person attributes of people in a group, the GAF is trained as the features of multi-person activity. As a person attribute, we propose to use a person's action class and appearance features because the former is easy to annotate due to its simpleness, and the latter requires no manual annotation. In addition, we introduce a location-guided attribute prediction to disentangle the complex GAF for extracting the features of each target person properly. Various experimental results validate that our method outperforms SOTA methods quantitatively and qualitatively on two public datasets. Visualization of our GAF also demonstrates that our method learns the GAF representing fined-grained group activity classes. Code: https://github.com/chihina/GAFL-CVPR2024.
- Akiko N. Aizawa. An information-theoretic perspective of tf-idf measures. Inf. Process. Manag., 39(1):45–65, 2003.
- Convolutional relational machine for group activity recognition. In CVPR, 2019.
- What are they doing? : Collective activity classification using spatio-temporal relationship among people. In ICCVW, 2009.
- Unsupervised visual representation learning by context prediction. In ICCV, 2015.
- Transrank: Self-supervised video representation learning via ranking-based transformation recognition. In CVPR, 2022.
- Joint learning of social groups, individuals action and sub-group activities in videos. In ECCV, 2020.
- Self-supervised video representation learning with odd-one-out networks. In CVPR, 2017.
- Richard B. Foster. American Football Playbook: 210 Field Templates. Createspace Independent Pub, 2016.
- Actor-transformers for group activity recognition. In CVPR, 2020.
- Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
- Dual-ai: Dual-path actor interaction learning for group activity recognition. In CVPR, 2022.
- Mask R-CNN. In ICCV, 2017.
- What do I annotate next? an empirical study of active learning for action localization. In ECCV, 2018.
- Hierarchical relational networks for group activity recognition and retrieval. In ECCV, 2018.
- A hierarchical deep temporal model for group activity recognition. In CVPR, 2016.
- Detector-free weakly supervised group activity recognition. In CVPR, 2022.
- Adam: A method for stochastic optimization. In ICLR, 2015.
- Colorization as a proxy task for visual understanding. In CVPR, 2017.
- Groupformer: Group activity recognition with clustered spatial-temporal transformer. In ICCV, 2021.
- Revisiting a knn-based image classification system with high-capacity storage. In ECCV, 2022.
- Interaction-aware joint attention estimation using people attributes. In ICCV, 2023.
- Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.
- Context encoders: Feature learning by inpainting. In CVPR, 2016.
- Empowering relational network by self-attention augmented conditional random fields for group activity recognition. In ECCV, 2020.
- Hybrid active learning via deep clustering for video action detection. In CVPR, 2023.
- Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
- Real-world anomaly detection in surveillance videos. In CVPR, 2018.
- Hunting group clues with transformers for social group activity recognition. In ECCV, 2022.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Attention is all you need. In NIPS, 2017.
- Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In CVPR, 2019.
- Self-supervised video representation learning by pace prediction. In ECCV, 2020.
- Recurrent modeling of interaction context for collective activity recognition. In CVPR, 2017.
- Hybrid relation guided set matching for few-shot action recognition. In CVPR, 2022.
- Sports video analysis on large-scale data. In ECCV, 2022.
- Learning actor relation graphs for group activity recognition. In CVPR, 2019.
- Social adaptive module for weakly-supervised group activity recognition. In ECCV, 2020.
- Higcin: Hierarchical graph-based cross inference network for group activity recognition. IEEE Trans. Pattern Anal. Mach. Intell., 45(6):6955–6968, 2023.
- Video playback rate perception for self-supervised spatio-temporal representation learning. In CVPR, 2020.
- Spatio-temporal dynamic inference network for group activity recognition. In ICCV, 2021.
- AET vs. AED: unsupervised representation learning by auto-encoding transformations rather than data. In CVPR, 2019.
- Few-shot action recognition with hierarchical matching and contrastive learning. In ECCV, 2022.
- COMPOSER: compositional reasoning of group activity in videos with keypoint-only modality. In ECCV, 2022.