CLIP’s ability to perceive actions in image–text matching
Determine whether CLIP (Contrastive Language–Image Pretraining) can effectively perceive actions—defined as states or relationships between objects—in the image–text matching task by establishing the extent of its action-level visual perception and alignment capabilities.
References
Despite the impressive performance, whether CLIP can effectively perceive actions (i.e., states or relationships between objects) in the image-text matching task remains unresolved.
— LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching
(2506.23502 - Tian et al., 30 Jun 2025) in Section 1 (Introduction)