Detecting and Recognizing Human-Object Interactions (1704.07333v3)

Published 24 Apr 2017 in cs.CV

Abstract: To understand the visual world, a machine must not only recognize individual object instances but also how they interact. Humans are often at the center of such interactions and detecting human-object interactions is an important practical and scientific problem. In this paper, we address the task of detecting <human, verb, object> triplets in challenging everyday photos. We propose a novel model that is driven by a human-centric approach. Our hypothesis is that the appearance of a person -- their pose, clothing, action -- is a powerful cue for localizing the objects they are interacting with. To exploit this cue, our model learns to predict an action-specific density over target object locations based on the appearance of a detected person. Our model also jointly learns to detect people and objects, and by fusing these predictions it efficiently infers interaction triplets in a clean, jointly trained end-to-end system we call InteractNet. We validate our approach on the recently introduced Verbs in COCO (V-COCO) and HICO-DET datasets, where we show quantitatively compelling results.

Citations (554)

View on Semantic Scholar

Summary

The paper introduces InteractNet, a human-centric model that significantly advances HOI detection by predicting action-specific density over object locations.
It leverages an end-to-end architecture based on Faster R-CNN, integrating object detection with interaction inference using human appearance cues.
Experimental evaluations on V-COCO and HICO-DET report a 26-27% improvement in role AP, demonstrating substantial gains in interaction detection accuracy.

Detecting and Recognizing Human-Object Interactions

Overview

The paper undertaken by Georgia Gkioxari, Ross Girshick, Piotr Doll, and Kaiming He at Facebook AI Research addresses a pivotal issue in computer vision: the recognition and detection of human-object interactions (HOIs). Their work emphasizes the necessity of not only identifying individual objects but understanding their interactions, particularly centering on human involvement.

Key Contributions

InteractNet Model: The authors introduce InteractNet, a model that employs a human-centric approach. This framework advances beyond traditional object detection models by predicting an action-specific density over target object locations derived from the appearance of detected people, including cues like pose and clothing.

Human-Centric Approach: This method capitalizes on the human appearance to localize interacting objects. By learning a 4-dimensional Gaussian distribution for potential object locations within human-centric regions, InteractNet effectively narrows the search space for target objects.

End-to-End Learning: InteractNet's architecture is based on the Faster R-CNN framework, allowing simultaneous learning of object detection and interaction inference. This end-to-end model predicts interaction triplets, integrating person and object detection with action-target location prediction.

Experimental Validation

The authors validate their approach on the Verbs in COCO (V-COCO) and HICO-DET datasets. InteractNet demonstrates significant improvements, achieving a 26% relative improvement in average precision for interaction detection over existing baselines on the V-COCO dataset and a 27% enhancement on the HICO-DET dataset.

Performance Metrics: The primary metric used is role AP, which gauges the precision of detecting interaction triplets ( $\langle\text{human, action, object}\rangle$ ). InteractNet improves this metric substantially by accurately inferring target objects based on human appearance.

Practical and Theoretical Implications

InteractNet provides a robust framework for improved human-centric understanding in visual recognition. Its effectiveness and efficiency—running at approximately 135ms per image—make it viable for real-world applications requiring rapid scene interpretation, such as autonomous vehicles, surveillance, and human-computer interaction systems.

Future Research Directions: The paper opens avenues for further exploration into reasoning-based models that can resolve more complex interaction scenarios and handle ambiguous visual cues. Enhancements in multi-modal prediction capabilities could also be explored to refine interaction detection algorithms.

Conclusion

The paper's findings underscore the potential of human-centric approaches in interaction detection tasks. InteractNet's integration of action-specific cues with object localization within an established detection framework positions it as a significant advancement in the domain of visual understanding, providing a template for future explorations in the field.

PDF Markdown