Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Group Activity Features Through Person Attribute Prediction (2403.02753v2)

Published 5 Mar 2024 in cs.CV

Abstract: This paper proposes Group Activity Feature (GAF) learning in which features of multi-person activity are learned as a compact latent vector. Unlike prior work in which the manual annotation of group activities is required for supervised learning, our method learns the GAF through person attribute prediction without group activity annotations. By learning the whole network in an end-to-end manner so that the GAF is required for predicting the person attributes of people in a group, the GAF is trained as the features of multi-person activity. As a person attribute, we propose to use a person's action class and appearance features because the former is easy to annotate due to its simpleness, and the latter requires no manual annotation. In addition, we introduce a location-guided attribute prediction to disentangle the complex GAF for extracting the features of each target person properly. Various experimental results validate that our method outperforms SOTA methods quantitatively and qualitatively on two public datasets. Visualization of our GAF also demonstrates that our method learns the GAF representing fined-grained group activity classes. Code: https://github.com/chihina/GAFL-CVPR2024.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Akiko N. Aizawa. An information-theoretic perspective of tf-idf measures. Inf. Process. Manag., 39(1):45–65, 2003.
  2. Convolutional relational machine for group activity recognition. In CVPR, 2019.
  3. What are they doing? : Collective activity classification using spatio-temporal relationship among people. In ICCVW, 2009.
  4. Unsupervised visual representation learning by context prediction. In ICCV, 2015.
  5. Transrank: Self-supervised video representation learning via ranking-based transformation recognition. In CVPR, 2022.
  6. Joint learning of social groups, individuals action and sub-group activities in videos. In ECCV, 2020.
  7. Self-supervised video representation learning with odd-one-out networks. In CVPR, 2017.
  8. Richard B. Foster. American Football Playbook: 210 Field Templates. Createspace Independent Pub, 2016.
  9. Actor-transformers for group activity recognition. In CVPR, 2020.
  10. Unsupervised representation learning by predicting image rotations. In ICLR, 2018.
  11. Dual-ai: Dual-path actor interaction learning for group activity recognition. In CVPR, 2022.
  12. Mask R-CNN. In ICCV, 2017.
  13. What do I annotate next? an empirical study of active learning for action localization. In ECCV, 2018.
  14. Hierarchical relational networks for group activity recognition and retrieval. In ECCV, 2018.
  15. A hierarchical deep temporal model for group activity recognition. In CVPR, 2016.
  16. Detector-free weakly supervised group activity recognition. In CVPR, 2022.
  17. Adam: A method for stochastic optimization. In ICLR, 2015.
  18. Colorization as a proxy task for visual understanding. In CVPR, 2017.
  19. Groupformer: Group activity recognition with clustered spatial-temporal transformer. In ICCV, 2021.
  20. Revisiting a knn-based image classification system with high-capacity storage. In ECCV, 2022.
  21. Interaction-aware joint attention estimation using people attributes. In ICCV, 2023.
  22. Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, 2016.
  23. Context encoders: Feature learning by inpainting. In CVPR, 2016.
  24. Empowering relational network by self-attention augmented conditional random fields for group activity recognition. In ECCV, 2020.
  25. Hybrid active learning via deep clustering for video action detection. In CVPR, 2023.
  26. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  27. Real-world anomaly detection in surveillance videos. In CVPR, 2018.
  28. Hunting group clues with transformers for social group activity recognition. In ECCV, 2022.
  29. Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
  30. Attention is all you need. In NIPS, 2017.
  31. Self-supervised spatio-temporal representation learning for videos by predicting motion and appearance statistics. In CVPR, 2019.
  32. Self-supervised video representation learning by pace prediction. In ECCV, 2020.
  33. Recurrent modeling of interaction context for collective activity recognition. In CVPR, 2017.
  34. Hybrid relation guided set matching for few-shot action recognition. In CVPR, 2022.
  35. Sports video analysis on large-scale data. In ECCV, 2022.
  36. Learning actor relation graphs for group activity recognition. In CVPR, 2019.
  37. Social adaptive module for weakly-supervised group activity recognition. In ECCV, 2020.
  38. Higcin: Hierarchical graph-based cross inference network for group activity recognition. IEEE Trans. Pattern Anal. Mach. Intell., 45(6):6955–6968, 2023.
  39. Video playback rate perception for self-supervised spatio-temporal representation learning. In CVPR, 2020.
  40. Spatio-temporal dynamic inference network for group activity recognition. In ICCV, 2021.
  41. AET vs. AED: unsupervised representation learning by auto-encoding transformations rather than data. In CVPR, 2019.
  42. Few-shot action recognition with hierarchical matching and contrastive learning. In ECCV, 2022.
  43. COMPOSER: compositional reasoning of group activity in videos with keypoint-only modality. In ECCV, 2022.

Summary

  • The paper introduces an innovative proxy task using person attribute prediction to learn group activity features without explicit annotations.
  • It integrates an end-to-end framework with location-guided predictions to capture fine-grained group dynamics in complex scenes.
  • Experimental results demonstrate that the method outperforms state-of-the-art GAR techniques, offering a scalable solution for understanding group interactions.

Learning Group Activity Features Through Person Attribute Prediction

Introduction

Group Activity Recognition (GAR) remains a pivotal challenge within the field of computer vision, especially in scenarios involving multiple participants, such as sports events, social gatherings, and surveillance footage. Traditional approaches to GAR heavily rely on supervised learning frameworks, necessitating extensive, manually-annotated data on group activities. However, the inherent complexity and subtlety of group dynamics, coupled with the labor-intensive process of annotation, present significant barriers to scaling and effectively learning nuanced group activity features.

Addressing these challenges, a novel approach introduced by Nakatani et al. proposes an innovative system for learning Group Activity Features (GAF), leveraging person attribute prediction as a proxy task. By predicting individual's attributes within a group — such as actions or appearance features — the system indirectly learns a compact representation of group activities without requiring explicit annotations for the same. This indirect learning methodology potentially sidesteps the issues related to direct group activity annotation, offering a streamlined pathway to capturing the essence of group dynamics.

Methodology

The proposed method introduces a multi-faceted approach to GAF learning, ranking its novelty in using person attribute prediction for implicit feature learning. The methodology revolves around several core components:

  • Person Attribute Prediction as a Proxy Task: Unlike traditional GAR approaches, this method relies on predicting individual attributes (e.g., person action classes and appearance features) within a group context. This prediction task indirectly forces the model to learn group activity features beneficial for understanding group dynamics without manual annotations of the group activity itself.
  • End-to-End Learning Framework: By integrating attribute prediction into an end-to-end learning system, the method ensures that the GAF is optimized for capturing relevant group activity information as required for the proxy task. This integration of tasks streamlines the learning process, enhancing the efficiency and efficacy of GAF acquisition.
  • Location-Guided Attribute Prediction: Understanding that person attributes within a group are heavily influenced by individual locations, the approach incorporates location features into attribute predictions. This addition allows for a more nuanced extraction of individual contributions to group activities, facilitating a finer representation of group dynamics.

Results and Impact

Experimental results across various settings validate the superiority of the proposed method over state-of-the-art GAR approaches, particularly highlighting its robustness in understanding fine-grained group activities. The method's efficacy is not only demonstrated through improved quantitative performance metrics but also through qualitative visualizations that showcase its ability to discern subtle group activity differences.

The implications of such a methodology extend far beyond the immediate gains in performance metrics. By fundamentally shifting how group activities are learned, this research opens up new avenues for understanding complex social interactions in visual data. The indirect learning approach can significantly reduce the annotation burden, making GAR more accessible and applicable across various domains.

Future Directions

While the proposed method marks a significant advancement in GAR, the journey towards fully understanding group activities continues. Future research could explore additional proxy tasks that further encapsulate the nuances of group dynamics or investigate the integration of unsupervised learning techniques for even more scalable solutions. Moreover, expanding the application of such methodologies to diverse domains could catalyze breakthroughs in social robotics, crowd management, and interactive entertainment, to name a few.

Conclusion

In summary, Nakatani et al.'s method for learning Group Activity Features through person attribute prediction presents a promising shift in the landscape of group activity recognition. By alleviating the need for manual group activity annotations and leveraging indirect learning mechanisms, this research not only enhances the current capabilities in GAR but also charts a course for future innovations in the field.

X Twitter Logo Streamline Icon: https://streamlinehq.com