- The paper introduces CMMP, a novel zero-shot HOI detection framework that employs decoupled conditional vision and language prompts to improve generalization.
- The method integrates conditional vision prompts with instance and spatial priors, and language-aware prompt learning with consistency constraints for robust feature extraction and classification.
- Evaluations on the HICO-DET dataset demonstrate that CMMP achieves state-of-the-art performance in various zero-shot settings, significantly enhancing detection of unseen human-object interactions.
The paper "Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection" introduces a novel framework for zero-shot Human-Object Interaction (HOI) detection, aimed at recognizing both seen and unseen interaction categories within images. The proposed approach, termed Conditional Multi-Modal Prompts (CMMP), seeks to improve the generalization capabilities of large foundation models, such as CLIP, which are fine-tuned for the task of HOI detection.
Key Contributions:
- Conditional Multi-Modal Prompts (CMMP): CMMP enhances zero-shot HOI detection by utilizing decoupled vision and language prompts. These prompts serve distinct purposes: vision prompts focus on interactiveness-aware visual feature extraction, while language prompts support generalizable interaction classification.
- Vision Prompts with Prior Knowledge: The approach introduces conditional vision prompts that integrate priors of different granularity, specifically:
- Input-conditioned Instance Prior: Encourages the image encoder to equally prioritize instances that belong to either seen or potentially unseen HOI concepts.
- Global Spatial Pattern Prior: Provides a representative plausible spatial configuration of human and object interactions, serving as a bridge for transferring knowledge between seen and unseen interactions.
- Language-aware Prompt Learning: A consistency constraint is employed to maintain the integrity of the foundational model's learned knowledge, thereby enabling improved generalization to unseen classes. This involves employing human-designed prompts to regularize the soft prompts, preserving the semantic space.
- Structured Zero-shot HOI Detection Framework: The proposed framework divides the detection process into two key tasks to mitigate error propagation:
- Interactiveness-aware Visual Feature Extraction: Employs conditional vision prompts integrated into the image encoder using cross-attention mechanisms.
- Interaction Classification: Utilizes conditional language prompts, constrained by a consistency loss to prevent divergence from CLIP's original semantic space.
Experimental Results:
- Comprehensive Evaluation: CMMP is evaluated on the HICO-DET dataset across various zero-shot settings, including Unseen Composition (UC), Rare First Unseen Composition (RF-UC), Non-rare First Unseen Composition (NF-UC), Unseen Object (UO), and Unseen Verb (UV).
- State-of-the-art Performance: CMMP achieves significant improvements over existing zero-shot HOI detectors, demonstrating its superior generalization ability on unseen classes while maintaining competitive performance on seen classes. For instance, CMMP surpasses previous state-of-the-art models, particularly in challenging settings like RF-UC and NF-UC.
- Fully Supervised Benchmarking: The framework also provides competitive results under fully supervised settings on both HICO-DET and V-COCO datasets, showcasing its robustness and adaptability.
Conclusion:
The paper highlights CMMP's potential to enhance zero-shot HOI detection by leveraging the inherent capabilities of large vision-LLMs through carefully designed multi-modal prompts. The distinct handling of visual feature extraction and interaction classification using decoupled prompts allows for effective knowledge transfer and feature alignment, thereby improving the detection of unseen interactions. The authors emphasize that the proposed method not only establishes new benchmarks but also offers a promising direction for future research in zero-shot learning paradigms.