Spatially Conditioned Graphs for Detecting Human-Object Interactions

Published 11 Dec 2020 in cs.CV, cs.AI, and cs.LG | (2012.06060v3)

Abstract: We address the problem of detecting human-object interactions in images using graphical neural networks. Unlike conventional methods, where nodes send scaled but otherwise identical messages to each of their neighbours, we propose to condition messages between pairs of nodes on their spatial relationships, resulting in different messages going to neighbours of the same node. To this end, we explore various ways of applying spatial conditioning under a multi-branch structure. Through extensive experimentation we demonstrate the advantages of spatial conditioning for the computation of the adjacency structure, messages and the refined graph features. In particular, we empirically show that as the quality of the bounding boxes increases, their coarse appearance features contribute relatively less to the disambiguation of interactions compared to the spatial information. Our method achieves an mAP of 31.33% on HICO-DET and 54.2% on V-COCO, significantly outperforming state-of-the-art on fine-tuned detections.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (119)

View on Semantic Scholar

Summary

Overview of Spatially Conditioned Graphs for Detecting Human-Object Interactions

Detecting human-object interactions (HOIs) presents a complex challenge, requiring precise recognition of interactions between specific humans and objects within images. The paper "Spatially Conditioned Graphs for Detecting Human–Object Interactions" addresses this by introducing a novel approach utilizing graphical neural networks. The key innovation lies in employing spatial conditioning to differentiate messages within the network, rooted in the spatial relationships between nodes. This approach is particularly essential in refining the detection of HOIs by combining appearance features with spatial information, especially given the multitude of non-interactive pairs in typical imagery.

Methodology and Contributions

The researchers propose a bipartite graph structure, distinguishing between human and object nodes, and employ spatially conditioned message passing. This anisotropic method allows the messages between human and object nodes to incorporate spatial context, rather than relying solely on appearance features. The primary contributions of this paper are twofold:

Spatial Conditioning Across Graph Components: The method integrates spatial conditioning during the computation of adjacency structures, message passing, and refinement of graph features. This comprehensive approach ensures that the interaction knowledge utilized by the model is nuanced and contextually aware.
Empirical Validation with Multi-Branch Fusion: The paper introduces a multi-branch fusion module for combining spatial and appearance features, enhancing message expressiveness and improving interaction classification. The utility of this technique is demonstrated through extensive experiments on established datasets.

Evaluation and Results

The empirical results are significant, with their proposed method achieving substantial gains over state-of-the-art models on the HICO-DET and V-COCO datasets, particularly when leveraging fine-tuned detections. Specifically, the approach achieves a mean average precision (mAP) of 31.33% on HICO-DET and 54.2% on V-COCO, showcasing its effectiveness in accurately detecting HOIs. Notably, the performance improvements are more pronounced with high-quality detections, highlighting the method's capability to exploit better input data efficiently.

Implications and Future Directions

This research highlights that as the quality of detection boxes improves, spatial information's role becomes more crucial than coarse appearance features. The practical implication is that robust detector outputs can significantly enhance interaction detection performance. Theoretical implications suggest an evolving understanding of contextual and spatial relationships in model architectures, broadening potential applications in visual scene understanding and beyond.

For future developments, the integration of depth information might further enhance the model's ability to discern complex spatial configurations, potentially addressing challenges faced when dealing with densely packed objects or subtle spatial nuances. Additionally, extending this framework to other domains such as video analytics, where temporal dynamics play a crucial role, could open new avenues for real-time interactive scene analysis.

In summary, this paper provides an insightful stride towards improving the detection of human-object interactions by effectively harnessing spatial conditioning within graphical models, setting a precedent for future advancements in the field of computer vision.

Markdown Report Issue