Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics

Published 21 Apr 2022 in cs.CL | (2204.10216v1)

Abstract: How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations. We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice and propose changes to rectify this disconnect. First, we calculate the system score for an automatic metric using the full test set instead of the subset of summaries judged by humans, which is currently standard practice. We demonstrate how this small change leads to more precise estimates of system-level correlations. Second, we propose to calculate correlations only on pairs of systems that are separated by small differences in automatic scores which are commonly observed in practice. This allows us to demonstrate that our best estimate of the correlation of ROUGE to human judgments is near 0 in realistic scenarios. The results from the analyses point to the need to collect more high-quality human judgments and to improve automatic metrics when differences in system scores are small.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (41)

View on Semantic Scholar

Summary

The paper demonstrates that using the full test set drastically lowers variance in metric evaluations by approximately 99%.
The study shows that limiting evaluations to realistic system pairs reveals near-zero ROUGE-1 correlation with human judgment.
The modifications align evaluation methodology with practical applications, urging more granular and accurate metric assessments.

Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics

This paper addresses the inconsistencies within the current frameworks used to evaluate automatic summarization metrics and proposes modifications to align these frameworks more closely with practical use cases. The study scrutinizes two key areas: the evaluation dataset size and the system pair quality used in system-level correlation assessments.

Introduction

Automatic summarization evaluation metrics, such as ROUGE, BERTScore, and QA, are integral in assessing the quality of generated summaries without extensive human annotation. The standard method involves calculating system-level correlations to determine how well these metrics mirror human judgment. However, common practices fail to reflect the way these metrics are applied for system evaluations in research development, leading to potentially unreliable assessments.

Improving System-Level Correlation Estimations

Utilizing the Full Test Set

The paper suggests calculating system scores using the complete test set of summaries instead of only the human-judged subset.

Rationale: Metrics are generally evaluated using broader test data in real-world applications, thus demanding that evaluations mirror this practice for consistency and reliability.
Figure 1: The bootstrapped 95\% confidence intervals for the BERT-Score of each system in the REAL dataset using judged instances in blue and test instances in orange. Evaluating systems with test instances leads to far better estimates of their true scores.

Reducing Variance in Metric Evaluations

By utilizing more comprehensive datasets, the estimated variance of the automatic metric scores declines sharply (approximately 99% on average), enhancing ranking stability and confidence in the evaluation outcomes.

Figure 2: Bootstrapped estimates of the stabilities of the system rankings for automatic metrics and human annotations on Summ (left) and REAL (right). The tau value quantifies how similar two system rankings would be if they were computed with two random sets of M input documents. When all test instances are used, the automatic metrics' rankings become near constant. The error regions represent $\pm 1$ standard deviation.

Evaluating Metrics with Realistic System Pairs

Adjusting the Quality Gap in Evaluations

Current practices offer a skewed view by including all system pairs, even those with large score differentials, in evaluations. This paper proposes measuring correlations only among systems that have minor performance discrepancies, mimicking realistic usage scenarios.

Findings: When the evaluation is limited to such realistic pairs, very popular metrics, such as ROUGE-1, show negligible correlation with human judgment (approximately zero), contradicting prevalent beliefs regarding its efficacy in tighter benchmarks.
Figure 3: The sys\Delta(\ell,u) correlations on the Summ (top) and REAL (bottom) datasets for $\ell = 0$ and various values of $u$ for ROUGE-1, ROUGE-2, and ROUGE-L.

Implications and Recommendations

This study suggests that future evaluations should calculate metrics over entire datasets to lower estimation variance and more accurately reflect the typical gap in system quality when reporting metrics. Although current metrics achieve moderate correlations over larger ranges, they falter when systems perform closely, pointing to a need for improvements in metric sensitivity.

Practical measures should involve obtaining more accurate and consistent human judgments or enhanced methodology for metric evaluations with high granularity in system performance. Moreover, when new metrics are proposed, reporting more granular correlation assessments would provide a better perspective on their reliability.

Conclusion

The proposed evaluation modifications effectively reduce the variance and improve the precision in estimating system-level correlations, aligning evaluations better with practical applications. These changes unveil significant weaknesses within popular metrics like ROUGE, emphasizing the pressing demand for their enhancement and the need for additional high-quality human evaluations. Future work must focus on developing metrics capable of distinguishing subtle differences in system performance to maximize reliability in real-world scenarios.

Markdown Report Issue