Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

REACT: Recognize Every Action Everywhere All At Once (2312.00188v1)

Published 27 Nov 2023 in cs.CV

Abstract: Group Activity Recognition (GAR) is a fundamental problem in computer vision, with diverse applications in sports video analysis, video surveillance, and social scene understanding. Unlike conventional action recognition, GAR aims to classify the actions of a group of individuals as a whole, requiring a deep understanding of their interactions and spatiotemporal relationships. To address the challenges in GAR, we present REACT (\textbf{R}ecognize \textbf{E}very \textbf{Act}ion Everywhere All At Once), a novel architecture inspired by the transformer encoder-decoder model explicitly designed to model complex contextual relationships within videos, including multi-modality and spatio-temporal features. Our architecture features a cutting-edge Vision-Language Encoder block for integrated temporal, spatial, and multi-modal interaction modeling. This component efficiently encodes spatiotemporal interactions, even with sparsely sampled frames, and recovers essential local information. Our Action Decoder Block refines the joint understanding of text and video data, allowing us to precisely retrieve bounding boxes, enhancing the link between semantics and visual reality. At the core, our Actor Fusion Block orchestrates a fusion of actor-specific data and textual features, striking a balance between specificity and context. Our method outperforms state-of-the-art GAR approaches in extensive experiments, demonstrating superior accuracy in recognizing and understanding group activities. Our architecture's potential extends to diverse real-world applications, offering empirical evidence of its performance gains. This work significantly advances the field of group activity recognition, providing a robust framework for nuanced scene comprehension.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Sum product networks for activity recognition. 38(4):800–813, 2015.
  2. Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In ECCV, pages 187–200. Springer, 2012.
  3. Monte carlo tree search for scheduling activity recognition. In ICCV, pages 1353–1360, 2013.
  4. Hirf: Hierarchical random field for collective activity recognition in videos. In ECCV, pages 572–585. Springer, 2014.
  5. Vivit: A video vision transformer. arXiv preprint arXiv:2103.15691, 2021.
  6. Convolutional relational machine for group activity recognition. In CVPR, pages 7892–7901, 2019.
  7. Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In CVPR, pages 4315–4324, 2017.
  8. Is space-time attention all you need for video understanding? In ICML, page 4, 2021a.
  9. Is space-time attention all you need for video understanding? In ICML, page 4, 2021b.
  10. End-to-end object detection with transformers. In ECCV, pages 213–229. Springer, 2020.
  11. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, pages 6299–6308, 2017.
  12. Group activity recognition using self-supervised approach of spatiotemporal transformers. arXiv preprint arXiv:2303.12149, 2023a.
  13. Sogar: Self-supervised spatiotemporal attention-based social group activity recognition. arXiv preprint arXiv:2305.06310, 2023b.
  14. An empirical study of training self-supervised vision transformers. 2021.
  15. Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. In CVPR, pages 4772–4781, 2016.
  16. Joint learning of social groups, individuals action and sub-group activities in videos. In ECCV, pages 177–195. Springer, 2020.
  17. Jrdb-act: A large-scale dataset for spatio-temporal action, social group and activity detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20983–20992, 2022.
  18. Multiscale vision transformers. arXiv preprint arXiv:2104.11227, 2021.
  19. Actor-transformers for group activity recognition. In CVPR, pages 839–848, 2020.
  20. Mining inter-video proposal relations for video object detection. In European conference on computer vision, pages 431–446. Springer, 2020.
  21. Dual-ai: Dual-path actor interaction learning for group activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2990–2999, 2022a.
  22. Panoramic human activity recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, pages 244–261. Springer, 2022b.
  23. Progressive relation learning for group activity recognition. In CVPR, pages 980–989, 2020.
  24. Hierarchical relational networks for group activity recognition and retrieval. In ECCV, pages 721–736, 2018.
  25. A hierarchical deep temporal model for group activity recognition. In CVPR, pages 1971–1980, 2016.
  26. Detector-free weakly supervised group activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20083–20093, 2022.
  27. Adam: A method for stochastic optimization. In ICLR, 2015.
  28. Discriminative latent models for recognizing contextual group activities. 34(8):1549–1562, 2011.
  29. Social roles in hierarchical models for human activity recognition. In CVPR, pages 1354–1361. IEEE, 2012.
  30. Uniformer: Unifying convolution and self-attention for visual recognition, 2022.
  31. Groupformer: Group activity recognition with clustered spatial-temporal transformer. ICCV, 2021.
  32. Sbgar: Semantics based group activity recognition. In ICCV, pages 2876–2885, 2017.
  33. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  34. Keeping your eye on the ball: Trajectory attention in video transformers. In NeurIPS, 2021.
  35. Empowering relational network by self-attention augmented conditional random fields for group activity recognition. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16, pages 71–90. Springer, 2020.
  36. stagnet: An attentive semantic rnn for group activity recognition. In ECCV, pages 101–117, 2018.
  37. Self-supervised video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2874–2884, 2022.
  38. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019.
  39. Imagenet large scale visual recognition challenge. IJCV, 2015.
  40. Joint inference of groups, events and human roles in aerial videos. In CVPR, pages 4576–4584, 2015.
  41. Hierarchical long short-term concurrent memory for human interaction recognition. 2019.
  42. How to train your vit? data, augmentation, and regularization in vision transformers. 2021.
  43. Hunting group clues with transformers for social group activity recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, pages 19–35. Springer, 2022.
  44. Attention is all you need. In NIPS, pages 5998–6008, 2017.
  45. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pages 20–36. Springer, 2016.
  46. Recurrent modeling of interaction context for collective activity recognition. In CVPR, pages 3048–3056, 2017.
  47. Non-local neural networks. In CVPR, pages 7794–7803, 2018.
  48. Bilinear programming for human activity recognition with unknown mrf graphs. In CVPR, pages 1690–1697, 2013.
  49. Learning actor relation graphs for group activity recognition. In CVPR, pages 9964–9974, 2019.
  50. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10448–10457, 2021.
  51. Participation-contributed temporal dynamic model for group activity recognition. In Proceedings of the 26th ACM international conference on Multimedia, pages 1292–1300, 2018.
  52. Higcin: hierarchical graph-based cross inference network for group activity recognition. 2020a.
  53. Social adaptive module for weakly-supervised group activity recognition. In ECCV, pages 208–224. Springer, 2020b.
  54. Tubedetr: Spatio-temporal video grounding with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16442–16453, 2022.
  55. Learning visual context for group activity recognition. In AAAI, pages 3261–3269, 2021.
  56. Spatio-temporal dynamic inference network for group activity recognition. In ICCV, 2021.
  57. Fast collective activity recognition under weak supervision. IEEE Transactions on Image Processing, 29:29–43, 2019.
  58. Multi-label activity recognition using activity-specific features and activity correlations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14625–14635, 2021.
Citations (6)

Summary

We haven't generated a summary for this paper yet.