HAtt-Flow: Hierarchical Attention-Flow Mechanism for Group Activity Scene Graph Generation in Videos (2312.07740v1)
Abstract: Group Activity Scene Graph (GASG) generation is a challenging task in computer vision, aiming to anticipate and describe relationships between subjects and objects in video sequences. Traditional Video Scene Graph Generation (VidSGG) methods focus on retrospective analysis, limiting their predictive capabilities. To enrich the scene understanding capabilities, we introduced a GASG dataset extending the JRDB dataset with nuanced annotations involving \textit{Appearance, Interaction, Position, Relationship, and Situation} attributes. This work also introduces an innovative approach, \textbf{H}ierarchical \textbf{Att}ention-\textbf{Flow} (HAtt-Flow) Mechanism, rooted in flow network theory to enhance GASG performance. Flow-Attention incorporates flow conservation principles, fostering competition for sources and allocation for sinks, effectively preventing the generation of trivial attention. Our proposed approach offers a unique perspective on attention mechanisms, where conventional "values" and "keys" are transformed into sources and sinks, respectively, creating a novel framework for attention-based models. Through extensive experiments, we demonstrate the effectiveness of our Hatt-Flow model and the superiority of our proposed Flow-Attention mechanism. This work represents a significant advancement in predictive video scene understanding, providing valuable insights and techniques for applications that require real-time relationship prediction in video data.
- Sum product networks for activity recognition. 38(4):800–813, 2015.
- Cost-sensitive top-down/bottom-up inference for multiscale activity recognition. In ECCV, pages 187–200. Springer, 2012.
- Monte carlo tree search for scheduling activity recognition. In ICCV, pages 1353–1360, 2013.
- Hirf: Hierarchical random field for collective activity recognition in videos. In ECCV, pages 572–585. Springer, 2014.
- Social scene understanding: End-to-end multi-person action localization and collective activity recognition. In CVPR, pages 4315–4324, 2017.
- Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015.
- Learning to detect human-object interactions. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), 2018.
- Knowledge-embedded routing network for scene graph generation. In CVPR, 2019.
- Detecting visual relationships with deep relational networks. In CVPR, 2017.
- Structure inference machines: Recurrent neural networks for analyzing relations in group activity recognition. In CVPR, pages 4772–4781, 2016.
- Learning of visual relations: The devil is in the tails. In ICCV, 2021.
- Baconian: A unified open-source framework for model-based reinforcement learning. arXiv preprint arXiv:1904.10762, 2019.
- Drg: Dual relation graph for human-object interaction detection. In ECCV, 2020.
- Actor-transformers for group activity recognition. In CVPR, pages 839–848, 2020.
- Detecting and recognizing human-object interactions. In CVPR, 2018.
- Scene graph generation with external knowledge and image reconstruction. In CVPR, 2019.
- Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.
- Dual-ai: Dual-path actor interaction learning for group activity recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2990–2999, 2022.
- Visual compositional learning for human-object interaction detection. In ECCV, 2020.
- Affordance transfer learning for human-object interaction detection. In CVPR, 2021.
- Contextual translation embedding for visual relationship detection and scene graph generation. IEEE TPAMI, 2020.
- Hierarchical relational networks for group activity recognition and retrieval. In ECCV, pages 721–736, 2018.
- A hierarchical deep temporal model for group activity recognition. In CVPR, pages 1971–1980, 2016.
- Image retrieval using scene graphs. In CVPR, 2015.
- Compositional learning for human object interaction. In ECCV, 2018.
- Segmentation-grounded scene graph generation. In ICCV, 2021.
- Uniondet: Union-level detector towards real-time human-object interaction detection. In ECCV, 2020.
- Hotr: End-to-end human-object interaction detection with transformers. In CVPR, 2021.
- Detecting visual relationships using box attention. In CVPRW, 2019.
- Discriminative latent models for recognizing contextual group activities. 34(8):1549–1562, 2011.
- Social roles in hierarchical models for human activity recognition. In CVPR, pages 1354–1361. IEEE, 2012.
- Sbgar: Semantics based group activity recognition. In ICCV, pages 2876–2885, 2017.
- Embodied semantic scene graph generation. In Conference on Robot Learning, pages 1585–1594. PMLR, 2022.
- Factorizable net: an efficient subgraph-based framework for scene graph generation. In ECCV, 2018.
- Transferable interactiveness knowledge for human-object interaction detection. In CVPR, 2019.
- Detailed 2d-3d joint representation for human-object interaction. In CVPR, 2020.
- Vrr-vg: Refocusing visually-relevant relationships. In ICCV, 2019.
- Gps-net: Graph property sensing network for scene graph generation. In CVPR, 2020.
- Amplifying key cues for human-object-interaction detection. In ECCV, 2020.
- Visual relationship detection with language priors. In Proceedings of the European Conference on Computer Vision (ECCV), pages 852–869, 2016.
- Jrdb: A dataset and benchmark of egocentric robot visual perception of humans in built environments. IEEE transactions on pattern analysis and machine intelligence, 2021.
- Video captioning with transferred semantic attributes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6504–6512, 2017.
- stagnet: An attentive semantic rnn for group activity recognition. In ECCV, pages 101–117, 2018.
- Attentive relational networks for mapping images to scene graphs. In CVPR, 2019.
- Video visual relation detection. In Proceedings of the 25th ACM international conference on Multimedia, pages 1300–1308, 2017.
- Joint inference of groups, events and human roles in aerial videos. In CVPR, pages 4576–4584, 2015.
- Hierarchical long short-term concurrent memory for human interaction recognition. 2019.
- Energy-based learning for scene graph generation. In CVPR, 2021.
- Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In CVPR, 2021.
- Hunting group clues with transformers for social group activity recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV, pages 19–35. Springer, 2022.
- Learning to compose dynamic tree structures for visual contexts. In CVPR, 2019a.
- Learning to compose dynamic tree structures for visual contexts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6619–6628, 2019b.
- Unbiased scene graph generation from biased training. In CVPR, 2020.
- Target adaptive context aggregation for video scene graph generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13688–13697, 2021.
- Recurrent modeling of interaction context for collective activity recognition. In CVPR, pages 3048–3056, 2017.
- Learning transferable human-object interaction detector with natural language supervision. In CVPR, 2022.
- Deep contextual attention for human-object interaction detection. In ICCV, 2019.
- Learning human-object interaction detection using interaction points. In CVPR, 2020.
- Bilinear programming for human activity recognition with unknown mrf graphs. In CVPR, pages 1690–1697, 2013.
- Scene graph generation by iterative message passing. In CVPR, 2017.
- Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296, 2016.
- Participation-contributed temporal dynamic model for group activity recognition. In Proceedings of the 26th ACM international conference on Multimedia, pages 1292–1300, 2018.
- Graph r-cnn for scene graph generation. In ECCV, 2018.
- Panoptic scene graph generation. In European Conference on Computer Vision, pages 178–196. Springer, 2022.
- Panoptic video scene graph generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18675–18685, 2023.
- Linguistic structures as weak supervision for visual scene graph generation. In CVPR, 2021.
- Bridging knowledge graphs to generate scene graphs. In ECCV, 2020a.
- Learning visual commonsense for robust scene graph generation. In ECCV, 2020b.
- Neural motifs: Scene graph parsing with global context. In CVPR, 2018.
- Mining the benefits of two-stage and one-stage hoi detection. NeurIPS, 2021.
- Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. In CVPR, 2022.
- Visual translation embedding network for visual relation detection. In CVPR, 2017a.
- Relationship proposal networks. In CVPR, 2017b.
- Learning to generate scene graph from natural language supervision. In ICCV, 2021.
- Cascaded human-object interaction recognition. In CVPR, 2020.
- End-to-end human object interaction detection with hoi transformer. In CVPR, 2021.