Papers
Topics
Authors
Recent
Search
2000 character limit reached

SummEval: Re-evaluating Summarization Evaluation

Published 24 Jul 2020 in cs.CL | (2007.12626v4)

Abstract: The scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization evaluation methods along five dimensions: 1) we re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations, 2) we consistently benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics, 3) we assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset and share it in a unified format, 4) we implement and share a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics, 5) we assemble and share the largest and most diverse, in terms of model types, collection of human judgments of model-generated summaries on the CNN/Daily Mail dataset annotated by both expert judges and crowd-source workers. We hope that this work will help promote a more complete evaluation protocol for text summarization as well as advance research in developing evaluation metrics that better correlate with human judgments.

Citations (615)

Summary

  • The paper demonstrates that 14 evaluation metrics show variable correlations with human judgments, especially for coherence and relevance.
  • It leverages a large collection of CNN/DailyMail model outputs with expert and crowd-sourced annotations to benchmark 23 neural summarization models.
  • The study introduces an adaptable API and standardized protocol, encouraging future improvements in summarization evaluation practices.

SummEval: Re-evaluating Summarization Evaluation

Introduction

The paper "SummEval: Re-evaluating Summarization Evaluation" presents a comprehensive and consistent re-evaluation of automatic evaluation metrics for text summarization. It leverages outputs from various neural summarization models, coupled with expert and crowd-sourced human annotations. The study is motivated by limitations in existing summarization evaluation methods and introduces several improvements across five dimensions, including re-evaluation of 14 automatic metrics and providing a benchmark for 23 recent summarization models.

Evaluation Metrics and Summarization Models

Evaluation Metrics

The study examines 14 evaluation metrics, notably ROUGE, BertScore, MoverScore, and others like S3S^3, CHRF, and CIDEr. These metrics are evaluated for their ability to correlate with human judgments acroos four quality dimensions: coherence, consistency, fluency, and relevance.

Summarization Models

The paper evaluates outputs from 23 summarization models using these metrics. The models are categorized into extractive and abstractive methods, including BART, Pegasus, and T5. These models are indicative of recent advancements in neural summarization approaches.

Methodology

The authors assemble the largest collection of model-generated summaries trained on the CNN/DailyMail dataset, complemented by expert and crowd-sourced human judgments. They introduce a toolkit providing an extensible API for evaluating summarization models across various metrics, promoting a standardized evaluation protocol.

Human Annotation Analysis

The paper highlights significant discrepancies between expert and crowd-source annotations. Krippendorff's alpha coefficients indicate low inter-annotator agreement within both groups initially, with subsequent rounds of expert annotation improving agreement substantially. These results underline the challenges in achieving consistent human evaluations, primarily when distinguishing between coherence, consistency, and relevance. Figure 1

Figure 1: Histogram of standard deviations of expert annotations and annotations from the first round of annotations across the four quality dimensions.

Metric Correlation Analysis

Findings reveal moderate to strong correlations for most metrics with consistency and fluency, whereas coherence and relevance show lower correlations. These tendencies suggest inherent limitations in current metrics for capturing coherence and relevance effectively. Figure 2

Figure 2: Pairwise Kendall's Tau correlations for all automatic evaluation metrics.

Model Re-evaluation

Through human and automatic evaluations, the study observes that models like Pegasus, BART, and T5 consistently achieve higher scores, suggesting progress in model quality over time. Furthermore, reference summaries in the CNN/DailyMail dataset received relatively low scores, attributed to issues like extraneous information and lack of coherence. These insights highlight areas for improvement in summarization model benchmarking and dataset quality.

Conclusion

The study underscores a comprehensive re-evaluation framework for summarization metrics and models. It advocates for improved evaluation protocols, addressing limitations in current methods, and urges future research towards more coherent and relevant metric designs. The resources and findings presented aim to propel advancements in summarization evaluation practices, encouraging engagement from the research community to refine these processes.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.