Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration (2403.07246v1)

Published 12 Mar 2024 in cs.CV

Abstract: Human-object interaction (HOI) detection aims to locate human-object pairs and identify their interaction categories in images. Most existing methods primarily focus on supervised learning, which relies on extensive manual HOI annotations. In this paper, we propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-LLM to improve zero-shot HOI detection. Specifically, the verb feature learning module is designed based on visual semantics, by employing the verb extraction decoder to convert corresponding verb queries into interaction-specific category representations. We develop an effective additive self-attention mechanism to generate more comprehensive visual representations. Moreover, the innovative interaction representation decoder effectively extracts informative regions by integrating spatial and visual feature information through a cross-attention mechanism. To deal with zero-shot learning in low-data, we leverage a priori knowledge from the CLIP text encoder to initialize the linear classifier for enhanced interaction understanding. Extensive experiments conducted on the mainstream HICO-DET and V-COCO datasets demonstrate that our model outperforms the previous methods in various zero-shot and full-supervised settings.

References (50)

Authors (7)

Weiying Xue (5 papers)
Qi Liu (485 papers)
Qiwei Xiong (3 papers)
Yuxiao Wang (21 papers)
Zhenao Wei (5 papers)
Xiaofen Xing (29 papers)
Xiangmin Xu (54 papers)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration (2403.07246v1)

Summary

Related Papers