Papers
Topics
Authors
Recent
2000 character limit reached

Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection (2408.02484v1)

Published 5 Aug 2024 in cs.CV

Abstract: Zero-shot Human-Object Interaction (HOI) detection has emerged as a frontier topic due to its capability to detect HOIs beyond a predefined set of categories. This task entails not only identifying the interactiveness of human-object pairs and localizing them but also recognizing both seen and unseen interaction categories. In this paper, we introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP. This approach enhances the generalization of large foundation models, such as CLIP, when fine-tuned for HOI detection. Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction and generalizable interaction classification, respectively. Specifically, we integrate prior knowledge of different granularity into conditional vision prompts, including an input-conditioned instance prior and a global spatial pattern prior. The former encourages the image encoder to treat instances belonging to seen or potentially unseen HOI concepts equally while the latter provides representative plausible spatial configuration of the human and object under interaction. Besides, we employ language-aware prompt learning with a consistency constraint to preserve the knowledge of the large foundation model to enable better generalization in the text branch. Extensive experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings. The code and models are available at \url{https://github.com/ltttpku/CMMP}.

Citations (3)

Summary

  • The paper introduces CMMP, a novel zero-shot HOI detection framework that employs decoupled conditional vision and language prompts to improve generalization.
  • The method integrates conditional vision prompts with instance and spatial priors, and language-aware prompt learning with consistency constraints for robust feature extraction and classification.
  • Evaluations on the HICO-DET dataset demonstrate that CMMP achieves state-of-the-art performance in various zero-shot settings, significantly enhancing detection of unseen human-object interactions.

The paper "Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection" introduces a novel framework for zero-shot Human-Object Interaction (HOI) detection, aimed at recognizing both seen and unseen interaction categories within images. The proposed approach, termed Conditional Multi-Modal Prompts (CMMP), seeks to improve the generalization capabilities of large foundation models, such as CLIP, which are fine-tuned for the task of HOI detection.

Key Contributions:

  1. Conditional Multi-Modal Prompts (CMMP): CMMP enhances zero-shot HOI detection by utilizing decoupled vision and language prompts. These prompts serve distinct purposes: vision prompts focus on interactiveness-aware visual feature extraction, while language prompts support generalizable interaction classification.
  2. Vision Prompts with Prior Knowledge: The approach introduces conditional vision prompts that integrate priors of different granularity, specifically:
    • Input-conditioned Instance Prior: Encourages the image encoder to equally prioritize instances that belong to either seen or potentially unseen HOI concepts.
    • Global Spatial Pattern Prior: Provides a representative plausible spatial configuration of human and object interactions, serving as a bridge for transferring knowledge between seen and unseen interactions.
  3. Language-aware Prompt Learning: A consistency constraint is employed to maintain the integrity of the foundational model's learned knowledge, thereby enabling improved generalization to unseen classes. This involves employing human-designed prompts to regularize the soft prompts, preserving the semantic space.
  4. Structured Zero-shot HOI Detection Framework: The proposed framework divides the detection process into two key tasks to mitigate error propagation:
    • Interactiveness-aware Visual Feature Extraction: Employs conditional vision prompts integrated into the image encoder using cross-attention mechanisms.
    • Interaction Classification: Utilizes conditional language prompts, constrained by a consistency loss to prevent divergence from CLIP's original semantic space.

Experimental Results:

  • Comprehensive Evaluation: CMMP is evaluated on the HICO-DET dataset across various zero-shot settings, including Unseen Composition (UC), Rare First Unseen Composition (RF-UC), Non-rare First Unseen Composition (NF-UC), Unseen Object (UO), and Unseen Verb (UV).
  • State-of-the-art Performance: CMMP achieves significant improvements over existing zero-shot HOI detectors, demonstrating its superior generalization ability on unseen classes while maintaining competitive performance on seen classes. For instance, CMMP surpasses previous state-of-the-art models, particularly in challenging settings like RF-UC and NF-UC.
  • Fully Supervised Benchmarking: The framework also provides competitive results under fully supervised settings on both HICO-DET and V-COCO datasets, showcasing its robustness and adaptability.

Conclusion:

The paper highlights CMMP's potential to enhance zero-shot HOI detection by leveraging the inherent capabilities of large vision-LLMs through carefully designed multi-modal prompts. The distinct handling of visual feature extraction and interaction classification using decoupled prompts allows for effective knowledge transfer and feature alignment, thereby improving the detection of unseen interactions. The authors emphasize that the proposed method not only establishes new benchmarks but also offers a promising direction for future research in zero-shot learning paradigms.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub