SummEval: Re-evaluating Summarization Evaluation (2007.12626v4)

Published 24 Jul 2020 in cs.CL

Abstract: The scarcity of comprehensive up-to-date studies on evaluation metrics for text summarization and the lack of consensus regarding evaluation protocols continue to inhibit progress. We address the existing shortcomings of summarization evaluation methods along five dimensions: 1) we re-evaluate 14 automatic evaluation metrics in a comprehensive and consistent fashion using neural summarization model outputs along with expert and crowd-sourced human annotations, 2) we consistently benchmark 23 recent summarization models using the aforementioned automatic evaluation metrics, 3) we assemble the largest collection of summaries generated by models trained on the CNN/DailyMail news dataset and share it in a unified format, 4) we implement and share a toolkit that provides an extensible and unified API for evaluating summarization models across a broad range of automatic metrics, 5) we assemble and share the largest and most diverse, in terms of model types, collection of human judgments of model-generated summaries on the CNN/Daily Mail dataset annotated by both expert judges and crowd-source workers. We hope that this work will help promote a more complete evaluation protocol for text summarization as well as advance research in developing evaluation metrics that better correlate with human judgments.

PDF Abstract

An Evaluation of Summarization Metrics and Models

The paper presents a comprehensive paper addressing the limitations in current text summarization evaluation methodologies by examining the effectiveness of 14 automatic evaluation metrics alongside human annotations. Additionally, it benchmarks 23 recent summarization models, focusing on the CNN/DailyMail dataset.

Key Contributions

Metric Evaluation: The paper assesses automatic evaluation metrics, such as ROUGE and BertScore, against expert and crowd-sourced human annotations. The metrics are evaluated based on their correlation with human judgments across four dimensions: coherence, consistency, fluency, and relevance.
Model Benchmarking: It provides a consistent benchmarking of 23 summarization models. These include both extractive and abstractive methods, with recent advancements in pretrained LLMs like T5 and Pegasus being highlighted.
Resource Release: The authors release a large collection of model outputs generated from the CNN/DailyMail dataset. This dataset ensures reproducibility and large-scale comparison of model performances.
Evaluation Toolkit: An extensible toolkit of evaluation metrics is introduced, offering a unified API for evaluating summarization models. It supports a wide range of metrics, allowing researchers to efficiently assess models.
Human Judgments: The paper compiles a diverse set of human judgments for model-generated summaries, contributing to the advancement of human-correlated metric development.

Insights and Results

Human vs. Automatic: It was found that expert annotations provided more reliable judgments compared to crowd-source annotations, which tended to be uniform across evaluation dimensions.
Metric Correlations: Higher-order ROUGE variants demonstrated moderate to strong correlations with human judgments, particularly in fluency and consistency, challenging other metrics to match this performance.
Model Performance: Pretrained models, such as Pegasus and BART, emerged as top performers, highlighting significant improvements in summarization quality with recent architectures.
Reference Summary Limitations: The paper identifies limitations within the CNN/DailyMail reference summaries, noting inconsistencies and formatting issues that could affect model evaluations.

Implications and Future Directions

The findings underscore the importance of developing more nuanced and reliable evaluation metrics that better align with human perception of summary quality. The paper suggests that automatic metrics need to be improved to effectively assess coherence and relevance. Furthermore, the availability of the SummEval toolkit and dataset aims to facilitate ongoing research in evaluation techniques and the improvement of summarization models.

Conclusion

This work offers valuable resources and insights to propel the field of text summarization forward. By highlighting discrepancies in current evaluation practices and providing robust tools for model assessment, the paper lays the groundwork for future advancements in generating and evaluating high-quality text summaries.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Alexander R. Fabbri (34 papers)
Bryan McCann (18 papers)
Caiming Xiong (337 papers)
Richard Socher (115 papers)
Dragomir Radev (98 papers)
Wojciech Kryściński (19 papers)

Citations (615)

View on Semantic Scholar