CIDEr: Consensus-based Image Description Evaluation (1411.5726v2)

Published 20 Nov 2014 in cs.CV, cs.CL, and cs.IR

Abstract: Automatically describing an image with a sentence is a long-standing challenge in computer vision and natural language processing. Due to recent progress in object detection, attribute classification, action recognition, etc., there is renewed interest in this area. However, evaluating the quality of descriptions has proven to be challenging. We propose a novel paradigm for evaluating image descriptions that uses human consensus. This paradigm consists of three main parts: a new triplet-based method of collecting human annotations to measure consensus, a new automated metric (CIDEr) that captures consensus, and two new datasets: PASCAL-50S and ABSTRACT-50S that contain 50 sentences describing each image. Our simple metric captures human judgment of consensus better than existing metrics across sentences generated by various sources. We also evaluate five state-of-the-art image description approaches using this new protocol and provide a benchmark for future comparisons. A version of CIDEr named CIDEr-D is available as a part of MS COCO evaluation server to enable systematic evaluation and benchmarking.

Citations (4,107)

View on Semantic Scholar

Summary

The paper introduces CIDEr, a novel metric that evaluates image description quality based on human consensus using TF-IDF weighted n-grams.
It demonstrates CIDEr's superiority over traditional metrics such as BLEU and ROUGE through extensive experiments and human annotation benchmarks.
The work presents new datasets and integrates CIDEr-D on the MS COCO server, enabling systematic and reliable evaluation of image captions.

CIDEr: Consensus-based Image Description Evaluation

In the field of computer vision and natural language processing, automatically generating descriptions for images has been a longstanding challenge. Traditional evaluation metrics such as BLEU and ROUGE, adapted from machine translation and text summarization tasks, have been shown to correlate weakly with human judgment when applied to image descriptions. The paper "CIDEr: Consensus-based Image Description Evaluation" addresses this evaluation challenge by introducing a novel metric called CIDEr (Consensus-based Image Description Evaluation), which better captures human consensus in image descriptions.

Key Contributions

This paper makes the following pivotal contributions:

Introduction of a Consensus-Based Evaluation Protocol: The proposed method evaluates how well a candidate description aligns with human-generated descriptions by measuring sentence similarity based on consensus. The evaluation involves collecting human annotations for consensus measurement and using these annotations to benchmark machine-generated descriptions.
Development of the CIDEr Metric: The core of this work is the CIDEr metric, which computes the consensus of a candidate sentence with a set of human-written reference sentences. CIDEr uses Term Frequency Inverse Document Frequency (TF-IDF) weighting to account for the importance and saliency of $n$ -grams, and averages cosine similarities over multiple reference sentences to produce a final score.
Creation of Two New Datasets: The authors introduce the PASCAL-50S and ABSTRACT-50S datasets, each containing 50 human-generated descriptions per image. This substantial increase in reference sentences per image provides a more reliable measure of consensus, which is crucial for accurate evaluation.
Release of CIDEr-D on the MS COCO Evaluation Server: To facilitate broad adoption and systematic benchmarking, a modified version of CIDEr, named CIDEr-D, has been integrated into the MS COCO caption evaluation server.

Detailed Examination of the CIDEr Metric

The CIDEr metric comprises multiple components:

TF-IDF Weighting: The metric uses TF-IDF to weigh each $n$ -gram. This reduces the weight of frequently occurring $n$ -grams across the dataset that might be less informative.
Cosine Similarity: CIDEr computes the cosine similarity between the candidate and reference sentences' $n$ -gram representations. This similarity measure inherently considers both precision and recall aspects of the $n$ -grams.
Multi- $n$ -grams Fusion: The metric combines scores from $n$ -grams of different lengths (up to 4-grams) to capture both syntactic and semantic richness.

Experimental Setup and Evaluation

The paper evaluates the performance of CIDEr against several existing metrics (BLEU, ROUGE, METEOR) on the newly introduced datasets. The experiments show that CIDEr outperforms these traditional metrics in aligning with human consensus:

Performance with Increased Reference Sentences: Extensive evaluations reveal that metrics like ROUGE and BLEU benefit from more reference sentences but still lag behind CIDEr, which maintains higher consistency and sensitivity to human consensus.
Pairwise Judgments and Triplet Annotations: Through user studies, the authors demonstrate that triplet-based human annotations are more objective and capture "human-likeness" more effectively compared to pairwise comparisons.
Comprehensive Benchmarking: The paper provides benchmarks for five state-of-the-art image description methods using CIDEr, showing that the metric can distinguish fine-grained differences between machine-generated sentences.

Implications and Future Directions

The introduction of CIDEr has profound implications for both theoretical and practical aspects of AI:

Theoretical Implications: By focusing on human consensus, CIDEr invites a reevaluation of how metrics are designed for natural language tasks in AI. It highlights the necessity of metrics that align more closely with human judgments, addressing the subjective nature of tasks like image description.
Practical Implications: Practically, CIDEr offers a reliable metric for evaluating and benchmarking image description systems, enabling more effective comparisons and fostering improvements in the generation algorithms.
Future Directions: Future research might explore further refinements to CIDEr to handle even more nuanced aspects of human consensus, such as context-dependent descriptions and multi-modal data. Additionally, efforts could focus on expanding the datasets to cover a wider range of scenarios and image complexities.

Conclusion

The introduction of CIDEr represents a significant step forward in the automated evaluation of image descriptions. By capturing human consensus more effectively than traditional metrics, CIDEr provides a robust tool for advancing research in computer vision and natural language processing. The integration of CIDEr-D into the MS COCO evaluation server further underscores its utility and potential for broad adoption in the research community.

PDF Markdown