Affordance Transfer Learning for Human-Object Interaction Detection (2104.02867v2)

Published 7 Apr 2021 in cs.CV, cs.AI, and cs.RO

Abstract: Reasoning the human-object interactions (HOI) is essential for deeper scene understanding, while object affordances (or functionalities) are of great importance for human to discover unseen HOIs with novel objects. Inspired by this, we introduce an affordance transfer learning approach to jointly detect HOIs with novel objects and recognize affordances. Specifically, HOI representations can be decoupled into a combination of affordance and object representations, making it possible to compose novel interactions by combining affordance representations and novel object representations from additional images, i.e. transferring the affordance to novel objects. With the proposed affordance transfer learning, the model is also capable of inferring the affordances of novel objects from known affordance representations. The proposed method can thus be used to 1) improve the performance of HOI detection, especially for the HOIs with unseen objects; and 2) infer the affordances of novel objects. Experimental results on two datasets, HICO-DET and HOI-COCO (from V-COCO), demonstrate significant improvements over recent state-of-the-art methods for HOI detection and object affordance detection. Code is available at https://github.com/zhihou7/HOI-CL

Citations (92)

View on Semantic Scholar

Summary

The paper introduces a framework that decouples HOI representations into affordance and object features to detect novel interactions.
It leverages additional object data to transfer known affordance features, significantly enhancing performance on rare interactions.
Empirical results on HICO-DET and HOI-COCO datasets show notable mAP improvements and effective zero-shot learning.

Affordance Transfer Learning for Human-Object Interaction Detection

This paper presents an affordance transfer learning framework designed to enhance Human-Object Interaction (HOI) detection by efficiently recognizing interactions involving novel objects. The primary focus is on the decoupling and independent modeling of object affordances, which are critical in identifying previously unseen interactions within varied scene contexts. By separating HOI representations into affordance and object components, the framework permits the composition of novel interactions through the combination of these decoupled parts.

Methodology

The crux of the method lies in its ability to perform affordance transfer learning (ATL), which involves transferring known affordances to new objects. This is achieved by leveraging additional object data to innovate novel combinations of human actions and objects, thus enhancing the model's capacity to detect HOIs across a broader spectrum of scenarios. Specifically, the framework is structured to:

Disentangle HOI representations into affordance and object representations.
Facilitate the composition of unseen HOIs by amalgamating affordance representations from the training set with new object representations sourced from supplementary image datasets.
Allow the HOI detection model to infer the affordances of novel objects by aligning known affordance features with features extracted from these novel objects.

Implementation and Experimentation

The framework's effectiveness is assessed using the HICO-DET and HOI-COCO datasets, where it demonstrates marked improvements over existing state-of-the-art approaches. It significantly benefits rare or unseen interactions, as these are inherently challenging due to the sparse data availability. The HOI model, integrated with weakly supervised affordance recognition capability, is trained through a combination of standard HOI samples and newly composed HOI instances, derived from blending object features with affordance features.

Results

Numerical results from the experiments underscore the system's proficiency. On the HICO-DET dataset, the proposed approach yields a notable increase in mean average precision (mAP), particularly enhancing the detection rate of rare interactions. When endowed with additional data from COCO objects, the network shows an even more pronounced improvement. The framework's robustness is further verified through a series of zero-shot learning experiments which validate its capacity to generalize beyond the limitations of the datasets it was initially exposed to.

Implications and Future Directions

This work has significant implications for both practical applications and theoretical advancements in AI. Practically, it sets a precedent for handling long-tailed data distributions in HOI recognition tasks, an essential factor for real-world applications in surveillance, assistive technology, and human-computer interaction systems. Theoretically, it expands upon compositional learning by exhibiting how affordance representations can be transferred across contexts, motivating future research in computational affordance recognition.

Looking ahead, the model's framework could be refined to integrate even larger and more diverse datasets, potentially incorporating fine-grained affordance representations that take into account the varying contexts within which interactions occur. Additionally, exploring the integration of this framework with explicit scene context understanding may further bridge the gap between perception and cognition in AI systems.

In summary, this paper advances the understanding and application of affordance utilizations within human-object interaction detection, offering a scalable solution to the detection challenges posed by unseen interactions through innovative compositional learning strategies.

PDF Markdown

Related Papers

GitHub

GitHub - zhihou7/HOI-CL: Series of work (ECCV2020, CVPR2021, CVPR2021, ECCV2022) about Compositional Learning for Human-Object Interaction Exploration (78 stars)