Reformulating HOI Detection as Adaptive Set Prediction (2103.05983v2)

Published 10 Mar 2021 in cs.CV

Abstract: Determining which image regions to concentrate on is critical for Human-Object Interaction (HOI) detection. Conventional HOI detectors focus on either detected human and object pairs or pre-defined interaction locations, which limits learning of the effective features. In this paper, we reformulate HOI detection as an adaptive set prediction problem, with this novel formulation, we propose an Adaptive Set-based one-stage framework (AS-Net) with parallel instances and interaction branches. To attain this, we map a trainable interaction query set to an interaction prediction set with a transformer. Each query adaptively aggregates the interaction-relevant features from global contexts through multi-head co-attention. Besides, the training process is supervised adaptively by matching each ground truth with the interaction prediction. Furthermore, we design an effective instance-aware attention module to introduce instructive features from the instance branch into the interaction branch. Our method outperforms previous state-of-the-art methods without any extra human pose and language features on three challenging HOI detection datasets. Especially, we achieve over $31\%$ relative improvement on a large-scale HICO-DET dataset. Code is available at https://github.com/yoyomimi/AS-Net.

Citations (135)

View on Semantic Scholar

Summary

The paper reformulates HOI detection as an adaptive set prediction task using the AS-Net framework to overcome instance- and location-centric limitations.
It employs a transformer-based architecture with multi-head co-attention and an instance-aware attention module to integrate global and local features.
Empirical results show that AS-Net outperforms existing methods, achieving over 31% improvement on the HICO-DET benchmark.

Reformulating HOI Detection as Adaptive Set Prediction

The paper "Reformulating HOI Detection as Adaptive Set Prediction" presents a novel approach to Human-Object Interaction (HOI) detection, reconceptualizing it as an adaptive set prediction problem. This methodology aims to address the limitations of traditional HOI detection paradigms, which typically rely on either detected human-object pairs or predefined interaction locations. Such conventional methods often fail to harness the full potential of relevant features critical for effective interaction prediction.

Methodology

The researchers propose an Adaptive Set-based one-stage framework (AS-Net), designed with parallel instance and interaction branches. This framework employs a trainable interaction query set, which is mapped to an interaction prediction set using transformer architectures. Utilizing multi-head co-attention, each query in the model selectively aggregates interaction-relevant features across global contexts. The supervision during training is adaptively matched with each ground-truth interaction, enhancing the model's capacity to focus on the appropriate features dynamically. Furthermore, the implementation of an instance-aware attention module effectively integrates instructive features from the instance branch into the interaction branch.

Key Contributions

Set Prediction Reformulation: The reformulation of HOI detection as a set prediction task introduces adaptability, enabling the model to overcome existing instance-centric and location-centric limitations, thus achieving greater prediction accuracy.
Transformer-Based Framework: AS-Net leverages the capabilities of transformers to perform set predictions, employing multi-layer decoder architectures for both instance and interaction branches, and facilitating the processing of global features for more accurate human-object recognition.
Instance-Aware Attention Module: The integration of the instance-aware attention module, which facilitates feature sharing between branches, enhances the model's comprehension of visual scenes, thereby contributing to improved performance.

Results and Implications

Empirical evaluations on several benchmark datasets, including HICO-DET, V-COCO, and HOI-A, demonstrate that AS-Net outperforms existing state-of-the-art models without leveraging auxiliary human pose or language features. Specifically, AS-Net achieves over 31% relative improvement on the HICO-DET dataset. This represents substantial progress in terms of both efficiency and effectiveness of HOI detection models.

The implications of this research extend towards the broader field of computer vision, where the reformulation into adaptive set prediction could influence related tasks such as visual relationship detection and multi-object tracking. Furthermore, as AI systems continue to require more nuanced and sophisticated understanding of human-object interactions, frameworks like AS-Net could play a pivotal role in advancing these capabilities.

Conclusion

The paper presents a compelling case for rethinking HOI detection through the lens of adaptive set prediction. The AS-Net framework sets a new standard by significantly enhancing the model's ability to utilize global context for feature aggregation and adaptive supervision. This approach not only addresses existing limitations in HOI detection methodologies but also opens new avenues for research in adaptive feature processing and transformer-based architectures in visual detection tasks. Future work could explore scaling this framework for more complex interactions and integrating it with other modalities to enhance real-world applicability.

PDF Markdown

Related Papers

GitHub

GitHub - yoyomimi/AS-Net: Code for one-stage adaptive set-based HOI detector AS-Net. (48 stars)