Re-evaluating Evaluation in Text Summarization (2010.07100v1)

Published 14 Oct 2020 in cs.CL, cs.IR, and cs.LG

Abstract: Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not -- for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both system-level and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.

Citations (162)

View on Semantic Scholar

Summary

The paper re-assesses the effectiveness of automated evaluation metrics by benchmarking them against human judgments.
It demonstrates that metrics like ROUGE-2 perform inconsistently across extractive and abstractive systems on different datasets.
The study advocates for adaptive evaluation frameworks and shared tasks to better align metrics with evolving neural summarization models.

Re-evaluating Evaluation in Text Summarization

The paper "Re-evaluating Evaluation in Text Summarization" by Bhandari et al. addresses the longstanding reliance on ROUGE as a standard evaluation metric for text summarization and explores the effectiveness of various automated metrics against human judgments. The authors recognize a gap in the field where evaluation methods have not advanced in tandem with the significant development of text summarization models. This paper aims to reassess these evaluation metrics using outputs from state-of-the-art neural models on recent datasets, specifically challenging the conclusions drawn from older datasets.

In their research, the authors conduct a meta-evaluation across multiple summarization systems and datasets, curated to include both extractive and abstractive techniques. They provide a comprehensive analysis by releasing a dataset of human judgments from 25 top neural summarization systems evaluated on the CNN/DailyMail dataset. Their experiments evaluate the alignment between automated metrics and human evaluations at both the system and summary levels.

Key Findings

Discrepancy in Metric Performance: It was observed that metrics, such as MoverScore and Jensen-Shannon divergence (JS-2), which performed well on older datasets like TAC, showed significantly different results on recent datasets. ROUGE-2 showed superior performance for abstractive systems in the CNN/DailyMail dataset but underperformed on older datasets compared to other metrics.
Evaluation of Top-k Systems: When focusing on top-performing systems, metrics generally showed reduced correlation with human judgments. However, certain metrics like ROUGE-2 remained consistently reliable for evaluating top abstractive systems in recent datasets.
Comparing Pairwise Systems: ROUGE metrics, particularly ROUGE-2, and JS-2 demonstrated better reliability in discerning differences between system pairs, though their effectiveness varied across datasets.
Summary-level Evaluation: The analysis revealed that summary-level correlations sometimes surpassed system-level correlations, challenging conclusions from prior studies which suggested metrics had lower accuracy at the summary level.

Implications and Future Directions

The paper highlights the necessity of selecting metrics appropriate for specific datasets and system types, as misalignment can produce misleading evaluation outcomes. This underscores the need for continued updates and diversification in evaluation benchmarks as summarization models evolve. The paper suggests implementing a shared task akin to the WMT Metrics Task for continuous co-evolution of systems and metrics. Furthermore, there is a call for meta-evaluations to incorporate multiple datasets reflecting varying summarization challenges, enhancing the robustness and applicability of metrics across contexts.

Overall, the research conducted by Bhandari et al. prompts a reconsideration of how text summarization systems are evaluated, advocating for an adaptive framework that keeps pace with technological advancements in AI and aligns more closely with human judgment. This work sets a precedent for developing more nuanced and comprehensive evaluation methodologies in text summarization, encouraging future research to focus on contextually adaptive and empirically tested evaluation strategies.

PDF Markdown

Related Papers

GitHub

GitHub - neulab/REALSumm: REALSumm: Re-evaluating Evaluation in Text Summarization (71 stars)