GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection (2203.13954v2)

Published 26 Mar 2022 in cs.CV and cs.AI

Abstract: The task of Human-Object Interaction~(HOI) detection could be divided into two core problems, i.e., human-object association and interaction understanding. In this paper, we reveal and address the disadvantages of the conventional query-driven HOI detectors from the two aspects. For the association, previous two-branch methods suffer from complex and costly post-matching, while single-branch methods ignore the features distinction in different tasks. We propose Guided-Embedding Network~(GEN) to attain a two-branch pipeline without post-matching. In GEN, we design an instance decoder to detect humans and objects with two independent query sets and a position Guided Embedding~(p-GE) to mark the human and object in the same position as a pair. Besides, we design an interaction decoder to classify interactions, where the interaction queries are made of instance Guided Embeddings (i-GE) generated from the outputs of each instance decoder layer. For the interaction understanding, previous methods suffer from long-tailed distribution and zero-shot discovery. This paper proposes a Visual-Linguistic Knowledge Transfer (VLKT) training strategy to enhance interaction understanding by transferring knowledge from a visual-linguistic pre-trained model CLIP. In specific, we extract text embeddings for all labels with CLIP to initialize the classifier and adopt a mimic loss to minimize the visual feature distance between GEN and CLIP. As a result, GEN-VLKT outperforms the state of the art by large margins on multiple datasets, e.g., +5.05 mAP on HICO-Det. The source codes are available at https://github.com/YueLiao/gen-vlkt.

Authors (6)

Yue Liao (35 papers)
Aixi Zhang (8 papers)
Miao Lu (13 papers)
Yongliang Wang (36 papers)
Xiaobo Li (85 papers)
Si Liu (132 papers)

Citations (114)

View on Semantic Scholar

Summary

The paper introduces a two-branch guided embedding network that simplifies human-object pairing by eliminating complex post-matching.
It leverages CLIP’s visual-linguistic knowledge transfer with text embeddings and a mimic loss to enhance interaction classification.
The approach achieves a +5.05 mAP improvement on the HICO-Det benchmark, demonstrating significant advances in HOI detection.

A Comprehensive Analysis of GEN-VLKT: Enhancing Human-Object Interaction Detection

The paper "GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection" addresses the dual challenges inherent in Human-Object Interaction (HOI) detection: human-object association and interaction understanding. HOI detection is crucial for machines to comprehend human activities in static images, requiring the identification of human-object pairs and their interactions through HOI triplets <Human, Object, Verb>. Traditional approaches often struggle with complex association methods and understanding interactions from long-tailed distributions or zero-shot scenarios. The authors propose a novel framework, GEN-VLKT, which mitigates these issues by integrating a two-branch architecture and a visual-linguistic knowledge transfer strategy.

The core component of this research is the Guided-Embedding Network (GEN), which simplifies the association problem by employing a two-branch pipeline without post-matching, differentiating it from previous query-driven HOI detectors. This is achieved through the design of an instance decoder and an interaction decoder. The instance decoder independently detects humans and objects using two distinct query sets and positional Guided Embedding (p-GE) to pair humans and objects effectively. This eliminates the need for costly and complex post-matching processes associated with traditional methods.

For interaction understanding, the paper introduces the Visual-Linguistic Knowledge Transfer (VLKT) strategy, leveraging the CLIP model's capabilities. CLIP, a pre-trained visual-linguistic model on 400 million image-text pairs, facilitates enhanced interaction classification by transferring its knowledge. The authors utilize CLIP's text embeddings to initialize classifiers and introduce a mimic loss to align visual features between GEN and CLIP, thus improving generalization and performance.

Numerical results are central to evaluating the efficacy of GEN-VLKT. The framework demonstrates substantial improvements over state-of-the-art methods across multiple datasets, with a notable +5.05 mAP on the HICO-Det benchmark. This improvement underscores the effectiveness of integrating guided embedding for association and leveraging large-scale visual-linguistic pre-trained models for enhanced interaction understanding.

The implications of this research are multifaceted. Practically, the introduction of GEN-VLKT has the potential to significantly enhance the precision and efficiency of real-world HOI detection systems. Theoretically, the successful application of a visual-linguistic model like CLIP suggests promising avenues for further exploration in transfer learning within the domain of human-object interactions. Future research may explore more granular integration of visual-linguistic data to tackle specific interaction categories or to develop more nuanced understanding of interaction contexts.

In conclusion, the GEN-VLKT framework provides substantial advancements in HOI detection by expertly addressing two persistent challenges in the domain. The innovative use of guided embedding and visual-linguistic training suggests promising directions for future developments in AI, particularly within the context of enhancing machine understanding of complex visual and interactive datasets.

PDF Markdown

Related Papers

GitHub

GitHub - YueLiao/gen-vlkt: Code for our CVPR 2022 Paper "GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection" (82 stars)