HOTR: End-to-End Human-Object Interaction Detection with Transformers (2104.13682v1)

Published 28 Apr 2021 in cs.CV

Abstract: Human-Object Interaction (HOI) detection is a task of identifying "a set of interactions" in an image, which involves the i) localization of the subject (i.e., humans) and target (i.e., objects) of interaction, and ii) the classification of the interaction labels. Most existing methods have indirectly addressed this task by detecting human and object instances and individually inferring every pair of the detected instances. In this paper, we present a novel framework, referred to by HOTR, which directly predicts a set of <human, object, interaction> triplets from an image based on a transformer encoder-decoder architecture. Through the set prediction, our method effectively exploits the inherent semantic relationships in an image and does not require time-consuming post-processing which is the main bottleneck of existing methods. Our proposed algorithm achieves the state-of-the-art performance in two HOI detection benchmarks with an inference time under 1 ms after object detection.

Authors (5)

Bumsoo Kim (18 papers)
Junhyun Lee (32 papers)
Jaewoo Kang (83 papers)
Eun-Sol Kim (15 papers)
Hyunwoo J. Kim (70 papers)

Citations (241)

View on Semantic Scholar

Summary

The paper introduces a novel transformer-based approach that predicts human-object interaction triplets in a single end-to-end step.
The paper leverages a dual decoder and Hungarian matching to eliminate redundant post-processing and significantly reduce inference time.
The paper achieves state-of-the-art accuracy on V-COCO and HICO-DET, outperforming traditional multi-step HOI detection methods.

Insights into "HOTR: End-to-End Human-Object Interaction Detection with Transformers"

This paper presents a novel framework called HOTR (Human-Object Interaction TRansformer) designed to improve Human-Object Interaction (HOI) detection by leveraging transformer-based architecture. The task of HOI detection involves identifying interactions between humans and objects in images, necessitating both precise localization and interaction classification. Existing approaches predominantly break this task into multiple steps, often requiring separate post-processing for associating detected humans and objects with their interactions. These methods tend to be time-consuming due to their complexity and redundancies.

HOTR proposes a direct set prediction approach facilitated by a transformer encoder-decoder architecture, which predicts $\langle$ human, object, interaction $\rangle$ triplets end-to-end. By adopting this method, HOTR aims to exploit inherent semantic relationships within the input images while eliminating the need for extensive post-processing. This architectural choice leverages the self-attention mechanism of transformers to model interdependencies between elements in a scene more effectively, leading to both improved inference speed and predictive accuracy.

Core Contributions

Transformer-Based Architecture: HOTR is the first method to apply transformer-based set prediction specifically to HOI detection. Unlike previous approaches which sequentially detect objects and interactions, HOTR's transformer framework allows for simultaneous processing and prediction, offering both efficiency and higher performance.
Efficient Inference: The integration of HOTR significantly reduces inference time to under 1 ms after object detection, outperforming traditional parallel HOI detectors, such as IPNet and UnionDet, which exhibit inference times between 5 and 9 ms.
State-of-the-Art Performance: HOTR achieves state-of-the-art performance on benchmark datasets—V-COCO and HICO-DET—demonstrating notable enhancements in accuracy. This achievement is attributed to HOTR's ability to eliminate redundant box regression and associate human-object pairs more effectively.

Methodology

The paper introduces a transformer encoder-decoder setup with two decoders running in parallel—an instance decoder for object detection and an interaction decoder for HOI set predictions. The interaction decoder produces interaction representations which then use HO Pointers to associate these interactions with the human and object detected by the instance decoder. This association eliminates redundant box predictions commonly found in alternative methods.

HOTR employs a Hungarian matching strategy for aligning model predictions with ground-truth interactions, thus enabling direct set-level predictions without traditional bounding box regression. By leveraging the attention mechanism, the model can account for complex dependencies and spatial coherences within the scene, which further enhances detection accuracy.

Results and Implications

The experimental results show that HOTR not only improves performance metrics (AP) on established datasets like V-COCO and HICO-DET but also streamlines the prediction process by being computationally efficient. HOTR demonstrates strong results across Full, Rare, and Non-Rare categories in the HICO-DET dataset, thanks to its adeptness in capturing interaction semantics. While challenges remain, especially in the Rare category due to data limitations, further incorporation of external features could bolster these results.

In conclusion, HOTR represents a significant evolution in HOI detection, attributing its success to the adoption of transformer architectures capable of handling complex interaction predictions efficiently. For future development, integrating broader contextual information or exploring more sophisticated decoder designs could further sharpen HOI detections, aligning with ongoing advancements in AI methodologies.

PDF Markdown