Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CLIPScore: A Reference-free Evaluation Metric for Image Captioning (2104.08718v3)

Published 18 Apr 2021 in cs.CV and cs.CL

Abstract: Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans. This is in contrast to the reference-free manner in which humans assess caption quality. In this paper, we report the surprising empirical finding that CLIP (Radford et al., 2021), a cross-modal model pretrained on 400M image+caption pairs from the web, can be used for robust automatic evaluation of image captioning without the need for references. Experiments spanning several corpora demonstrate that our new reference-free metric, CLIPScore, achieves the highest correlation with human judgements, outperforming existing reference-based metrics like CIDEr and SPICE. Information gain experiments demonstrate that CLIPScore, with its tight focus on image-text compatibility, is complementary to existing reference-based metrics that emphasize text-text similarities. Thus, we also present a reference-augmented version, RefCLIPScore, which achieves even higher correlation. Beyond literal description tasks, several case studies reveal domains where CLIPScore performs well (clip-art images, alt-text rating), but also where it is relatively weaker in comparison to reference-based metrics, e.g., news captions that require richer contextual knowledge.

CLIPScore: A Reference-free Evaluation Metric for Image Captioning

This paper explores an innovative approach to evaluating image captioning quality by introducing CLIPScore, a reference-free metric based on the CLIP model. Traditionally, image captioning has depended on reference-based automatic evaluations, which compare machine-generated captions against human-authored captions. This method often encounters limitations in generating comparative judgments that align with human evaluations. The paper presents empirical evidence indicating that CLIP—a cross-modal model pretrained on a vast dataset of 400 million image-caption pairs—can perform robust automatic evaluations without reference captions.

Methodology and Experiments

The authors employ the CLIP model, which learns aligned representations for images and text. They introduce CLIPScore (abbreviated as CLIP-S), which assesses the compatibility of an image-caption pair by computing the cosine similarity between their embeddings. Notably, this metric demonstrates high correlation with human judgments by focusing primarily on image-text compatibility without relying on reference captions.

Numerical Results and Comparisons

The empirical analysis spans multiple standard image captioning benchmarks, illustrating that CLIPScore surpasses existing reference-based metrics such as CIDEr and SPICE in terms of correlation with human judgments. For instance, on the Flickr8K-Expert corpus, CLIP-S achieves a Kendall τc\tau_c correlation of 51.2, outperforming ViLBERTScore-F's 50.1. Similarly, on the Composite corpus, CLIP-S achieves a τc\tau_c correlation of 53.8, compared to ViLBERTScore-F's 52.4.

Complementarity and Extensions

The paper further demonstrates that CLIPScore's image-text alignment complements traditional reference-based metrics that emphasize text-to-text similarities. To leverage this complementarity, the authors propose RefCLIPScore, which combines CLIPScore with maximum reference-based cosine similarity, yielding even higher correlations. For example, RefCLIPScore achieves a τc\tau_c of 55.4 on the Composite corpus, showing the additional benefit of integrating reference information.

Adversarial Robustness and Unseen Data

The authors validate CLIPScore's robustness through various experiments, including its sensitivity to adversarially altered captions, such as those from the FOIL dataset. CLIPScore successfully identifies incorrect captions with high accuracy, maintaining competitive performance even with minimal reference data. Furthermore, to address concerns related to memorization from pretraining data, the authors test CLIPScore on a unique dataset of previously unseen images, achieving an 86% agreement with human judgments, thereby reinforcing its generalizability.

Domain-specific Case Studies

The applicability of CLIPScore extends beyond literal image descriptions to more diverse domains. Four case studies explore correlations between CLIPScore and human judgment in non-standard scenarios:

  1. Alt-Text on Twitter: CLIPScore achieves a τc\tau_c of 48.4 for alt-text quality, demonstrating robustness where reference-based methods falter due to unreliable tweet contexts.
  2. Clip-Art Description (Abstract-50S): Despite lower performance relative to reference-based metrics, CLIPScore significantly outperforms baselines, indicating its capacity to handle non-photographic images.
  3. Personality Captions: While CLIPScore favors literal descriptions over engaging captions, it performs decently in predicting engagingness when comparing two non-literal descriptions.
  4. News Captions: CLIPScore underperforms in news image captioning, where context and named entities play a crucial role, reaffirming the strength of traditional reference-based metrics in this domain.

Implications and Future Work

This research implies significant practical and theoretical advancements. Practically, CLIPScore offers a viable alternative for scenarios where reference captions are unavailable or costly to obtain, simplifying the evaluation process for image captioning models. Theoretically, the high correlation with human judgment indicates the potential of pretrained vision-LLMs to assess generation tasks effectively.

Future developments may focus on fine-tuning CLIPScore with domain-specific human ratings, exploring its utility in reinforcement learning for caption generation, and further investigating the mitigation of biases inherent in pretrained models. The authors also emphasize caution, noting that CLIPScore, like all model-based metrics, mirrors the biases present in its training data, necessitating careful application and further scrutiny for ethical deployment.

In summary, CLIPScore represents a significant stride towards effective, reference-free evaluation for image captioning, with broad implications for improving automated assessments across varied domains.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jack Hessel (50 papers)
  2. Ari Holtzman (39 papers)
  3. Maxwell Forbes (14 papers)
  4. Ronan Le Bras (56 papers)
  5. Yejin Choi (287 papers)
Citations (1,100)