CLIP’s ability to perceive actions in image–text matching

Determine whether CLIP (Contrastive Language–Image Pretraining) can effectively perceive actions—defined as states or relationships between objects—in the image–text matching task by establishing the extent of its action-level visual perception and alignment capabilities.

Background

The paper studies image–text matching using CLIP, noting that while CLIP excels at global alignment, it often misses fine-grained visual information such as object attributes and spatial relationships. Recent prompt-based methods improve object-level understanding but typically do not address action perception.

The authors highlight failure cases and statistical analyses indicating CLIP’s inadequacies in accurately perceiving actions. Against this backdrop, they explicitly state that CLIP’s ability to perceive actions in image–text matching remains unresolved, motivating their LLM-enhanced, action-aware prompt-tuning approach.

References

Despite the impressive performance, whether CLIP can effectively perceive actions (i.e., states or relationships between objects) in the image-text matching task remains unresolved.

— LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching (2506.23502 - Tian et al., 30 Jun 2025) in Section 1 (Introduction)

CLIP’s ability to perceive actions in image–text matching

Background

References

Related Problems