CLIPScore: A Reference-free Evaluation Metric for Image Captioning
This paper explores an innovative approach to evaluating image captioning quality by introducing CLIPScore, a reference-free metric based on the CLIP model. Traditionally, image captioning has depended on reference-based automatic evaluations, which compare machine-generated captions against human-authored captions. This method often encounters limitations in generating comparative judgments that align with human evaluations. The paper presents empirical evidence indicating that CLIP—a cross-modal model pretrained on a vast dataset of 400 million image-caption pairs—can perform robust automatic evaluations without reference captions.
Methodology and Experiments
The authors employ the CLIP model, which learns aligned representations for images and text. They introduce CLIPScore (abbreviated as CLIP-S), which assesses the compatibility of an image-caption pair by computing the cosine similarity between their embeddings. Notably, this metric demonstrates high correlation with human judgments by focusing primarily on image-text compatibility without relying on reference captions.
Numerical Results and Comparisons
The empirical analysis spans multiple standard image captioning benchmarks, illustrating that CLIPScore surpasses existing reference-based metrics such as CIDEr and SPICE in terms of correlation with human judgments. For instance, on the Flickr8K-Expert corpus, CLIP-S achieves a Kendall correlation of 51.2, outperforming ViLBERTScore-F's 50.1. Similarly, on the Composite corpus, CLIP-S achieves a correlation of 53.8, compared to ViLBERTScore-F's 52.4.
Complementarity and Extensions
The paper further demonstrates that CLIPScore's image-text alignment complements traditional reference-based metrics that emphasize text-to-text similarities. To leverage this complementarity, the authors propose RefCLIPScore, which combines CLIPScore with maximum reference-based cosine similarity, yielding even higher correlations. For example, RefCLIPScore achieves a of 55.4 on the Composite corpus, showing the additional benefit of integrating reference information.
Adversarial Robustness and Unseen Data
The authors validate CLIPScore's robustness through various experiments, including its sensitivity to adversarially altered captions, such as those from the FOIL dataset. CLIPScore successfully identifies incorrect captions with high accuracy, maintaining competitive performance even with minimal reference data. Furthermore, to address concerns related to memorization from pretraining data, the authors test CLIPScore on a unique dataset of previously unseen images, achieving an 86% agreement with human judgments, thereby reinforcing its generalizability.
Domain-specific Case Studies
The applicability of CLIPScore extends beyond literal image descriptions to more diverse domains. Four case studies explore correlations between CLIPScore and human judgment in non-standard scenarios:
- Alt-Text on Twitter: CLIPScore achieves a of 48.4 for alt-text quality, demonstrating robustness where reference-based methods falter due to unreliable tweet contexts.
- Clip-Art Description (Abstract-50S): Despite lower performance relative to reference-based metrics, CLIPScore significantly outperforms baselines, indicating its capacity to handle non-photographic images.
- Personality Captions: While CLIPScore favors literal descriptions over engaging captions, it performs decently in predicting engagingness when comparing two non-literal descriptions.
- News Captions: CLIPScore underperforms in news image captioning, where context and named entities play a crucial role, reaffirming the strength of traditional reference-based metrics in this domain.
Implications and Future Work
This research implies significant practical and theoretical advancements. Practically, CLIPScore offers a viable alternative for scenarios where reference captions are unavailable or costly to obtain, simplifying the evaluation process for image captioning models. Theoretically, the high correlation with human judgment indicates the potential of pretrained vision-LLMs to assess generation tasks effectively.
Future developments may focus on fine-tuning CLIPScore with domain-specific human ratings, exploring its utility in reinforcement learning for caption generation, and further investigating the mitigation of biases inherent in pretrained models. The authors also emphasize caution, noting that CLIPScore, like all model-based metrics, mirrors the biases present in its training data, necessitating careful application and further scrutiny for ethical deployment.
In summary, CLIPScore represents a significant stride towards effective, reference-free evaluation for image captioning, with broad implications for improving automated assessments across varied domains.