Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection (2404.06194v2)

Published 9 Apr 2024 in cs.CV

Abstract: Open-vocabulary human-object interaction (HOI) detection, which is concerned with the problem of detecting novel HOIs guided by natural language, is crucial for understanding human-centric scenes. However, prior zero-shot HOI detectors often employ the same levels of feature maps to model HOIs with varying distances, leading to suboptimal performance in scenes containing human-object pairs with a wide range of distances. In addition, these detectors primarily rely on category names and overlook the rich contextual information that language can provide, which is essential for capturing open vocabulary concepts that are typically rare and not well-represented by category names alone. In this paper, we introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement (CMD-SE), harnessing the potential of Visual-LLMs (VLMs). Specifically, we propose to model human-object pairs with different distances with different levels of feature maps by incorporating a soft constraint during the bipartite matching process. Furthermore, by leveraging LLMs such as GPT models, we exploit their extensive world knowledge to generate descriptions of human body part states for various interactions. Then we integrate the generalizable and fine-grained semantics of human body parts to improve interaction recognition. Experimental results on two datasets, SWIG-HOI and HICO-DET, demonstrate that our proposed method achieves state-of-the-art results in open vocabulary HOI detection. The code and models are available at https://github.com/ltttpku/CMD-SE-release.

References (70)

Authors (3)

Ting Lei (17 papers)
Shaofeng Yin (5 papers)
Yang Liu (2253 papers)

Citations (7)

View on Semantic Scholar

Summary

The paper "Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection" proposes a novel framework for open-vocabulary human-object interaction (HOI) detection, with a focus on leveraging large foundation models in the form of Visual-LLMs (VLMs) and LLMs. The goal of open-vocabulary HOI detection is to accurately identify and interpret interactions involving human and object pairs described by arbitrary text inputs, accommodating novel or unseen interactions not encountered during the training phase.

The authors observe two main challenges in existing zero-shot HOI detection methods: (1) the use of uniform levels of feature maps for modeling human-object pairs across varying distances, leading to suboptimal performance, and (2) an overreliance on category names that neglects the rich contextual information afforded by natural language, which can capture open vocabulary concepts effectively.

To address these issues, the paper introduces an end-to-end framework with Conditional Multi-level Decoding and Semantic Enhancement (CMD-SE). The core contributions of this framework are as follows:

Conditional Multi-level Decoding (CMD): The framework proposes the use of different levels of feature maps to better model human-object interactions with varying spatial distances. By integrating a soft constraint during the bipartite matching process, low- and high-level feature maps are aligned with interactions involving different human-object pair distances, thereby improving recognition performance.
Fine-grained Semantic Enhancement (SE): The authors incorporate linguistic context derived from LLMs, such as GPT models, to generate detailed descriptions of human body part states for various interactions. This linguistic context provides generalizable and fine-grained semantics, enhancing both recognition accuracy and the model's ability to differentiate among HOI concepts.
Experimental Validation: The method was validated against two datasets, SWIG-HOI and HICO-DET, demonstrating state-of-the-art performance. Specifically, CMD-SE achieved significant improvements in recognition accuracy, emphasizing the efficacy of the combination of multi-level feature mapping and enhanced linguistic context in open-vocabulary detection tasks.

The paper offers insights into bridging vision and language modalities for HOI detection, proposing a scalable framework that expands beyond predefined interaction categories. The use of multi-level feature maps tailored to interaction distances and the novel utilization of fine-grained body part descriptions represent significant methodological advancements, particularly relevant for scenarios where text-based interaction descriptions are required.

Overall, the paper demonstrates a powerful synergy between visual and LLMs for open-vocabulary detection tasks, with strong numerical results across benchmark datasets, thus underscoring the framework's capacity for extensive generalization beyond traditional closed-set methodologies.

PDF Markdown

Tweets

https://twitter.com/CSVisionPapers/status/1778160401659691411

Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection (2404.06194v2)

Summary

Related Papers

Tweets