Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection

Published 5 Aug 2024 in cs.CV | (2408.02484v1)

Abstract: Zero-shot Human-Object Interaction (HOI) detection has emerged as a frontier topic due to its capability to detect HOIs beyond a predefined set of categories. This task entails not only identifying the interactiveness of human-object pairs and localizing them but also recognizing both seen and unseen interaction categories. In this paper, we introduce a novel framework for zero-shot HOI detection using Conditional Multi-Modal Prompts, namely CMMP. This approach enhances the generalization of large foundation models, such as CLIP, when fine-tuned for HOI detection. Unlike traditional prompt-learning methods, we propose learning decoupled vision and language prompts for interactiveness-aware visual feature extraction and generalizable interaction classification, respectively. Specifically, we integrate prior knowledge of different granularity into conditional vision prompts, including an input-conditioned instance prior and a global spatial pattern prior. The former encourages the image encoder to treat instances belonging to seen or potentially unseen HOI concepts equally while the latter provides representative plausible spatial configuration of the human and object under interaction. Besides, we employ language-aware prompt learning with a consistency constraint to preserve the knowledge of the large foundation model to enable better generalization in the text branch. Extensive experiments demonstrate the efficacy of our detector with conditional multi-modal prompts, outperforming previous state-of-the-art on unseen classes of various zero-shot settings. The code and models are available at \url{https://github.com/ltttpku/CMMP}.

Abstract PDF HTML Upgrade to Chat

Citations (3)

View on Semantic Scholar

Summary

The paper introduces CMMP, a novel zero-shot HOI detection framework that employs decoupled conditional vision and language prompts to improve generalization.
The method integrates conditional vision prompts with instance and spatial priors, and language-aware prompt learning with consistency constraints for robust feature extraction and classification.
Evaluations on the HICO-DET dataset demonstrate that CMMP achieves state-of-the-art performance in various zero-shot settings, significantly enhancing detection of unseen human-object interactions.

The paper "Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection" introduces a novel framework for zero-shot Human-Object Interaction (HOI) detection, aimed at recognizing both seen and unseen interaction categories within images. The proposed approach, termed Conditional Multi-Modal Prompts (CMMP), seeks to improve the generalization capabilities of large foundation models, such as CLIP, which are fine-tuned for the task of HOI detection.

Key Contributions:

Conditional Multi-Modal Prompts (CMMP): CMMP enhances zero-shot HOI detection by utilizing decoupled vision and language prompts. These prompts serve distinct purposes: vision prompts focus on interactiveness-aware visual feature extraction, while language prompts support generalizable interaction classification.
Vision Prompts with Prior Knowledge: The approach introduces conditional vision prompts that integrate priors of different granularity, specifically:
- Input-conditioned Instance Prior: Encourages the image encoder to equally prioritize instances that belong to either seen or potentially unseen HOI concepts.
- Global Spatial Pattern Prior: Provides a representative plausible spatial configuration of human and object interactions, serving as a bridge for transferring knowledge between seen and unseen interactions.
Language-aware Prompt Learning: A consistency constraint is employed to maintain the integrity of the foundational model's learned knowledge, thereby enabling improved generalization to unseen classes. This involves employing human-designed prompts to regularize the soft prompts, preserving the semantic space.
Structured Zero-shot HOI Detection Framework: The proposed framework divides the detection process into two key tasks to mitigate error propagation:
- Interactiveness-aware Visual Feature Extraction: Employs conditional vision prompts integrated into the image encoder using cross-attention mechanisms.
- Interaction Classification: Utilizes conditional language prompts, constrained by a consistency loss to prevent divergence from CLIP's original semantic space.

Experimental Results:

Comprehensive Evaluation: CMMP is evaluated on the HICO-DET dataset across various zero-shot settings, including Unseen Composition (UC), Rare First Unseen Composition (RF-UC), Non-rare First Unseen Composition (NF-UC), Unseen Object (UO), and Unseen Verb (UV).
State-of-the-art Performance: CMMP achieves significant improvements over existing zero-shot HOI detectors, demonstrating its superior generalization ability on unseen classes while maintaining competitive performance on seen classes. For instance, CMMP surpasses previous state-of-the-art models, particularly in challenging settings like RF-UC and NF-UC.
Fully Supervised Benchmarking: The framework also provides competitive results under fully supervised settings on both HICO-DET and V-COCO datasets, showcasing its robustness and adaptability.

Conclusion:

The study highlights CMMP's potential to enhance zero-shot HOI detection by leveraging the inherent capabilities of large vision-LLMs through carefully designed multi-modal prompts. The distinct handling of visual feature extraction and interaction classification using decoupled prompts allows for effective knowledge transfer and feature alignment, thereby improving the detection of unseen interactions. The authors emphasize that the proposed method not only establishes new benchmarks but also offers a promising direction for future research in zero-shot learning paradigms.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (4)

Collections

GitHub

GitHub - ltttpku/CMMP (13 stars)

YouTube

Show All Videos

Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (4)

Collections

GitHub

YouTube

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection

Summary

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (4)

Collections

GitHub

YouTube

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research