QuestEval: Summarization Asks for Fact-based Evaluation (2103.12693v2)

Published 23 Mar 2021 in cs.CL

Abstract: Summarization evaluation remains an open research problem: current metrics such as ROUGE are known to be limited and to correlate poorly with human judgments. To alleviate this issue, recent work has proposed evaluation metrics which rely on question answering models to assess whether a summary contains all the relevant information in its source document. Though promising, the proposed approaches have so far failed to correlate better than ROUGE with human judgments. In this paper, we extend previous approaches and propose a unified framework, named QuestEval. In contrast to established metrics such as ROUGE or BERTScore, QuestEval does not require any ground-truth reference. Nonetheless, QuestEval substantially improves the correlation with human judgments over four evaluation dimensions (consistency, coherence, fluency, and relevance), as shown in the extensive experiments we report.

Authors (7)

Thomas Scialom (35 papers)
Paul-Alexis Dray (7 papers)
Patrick Gallinari (73 papers)
Sylvain Lamprier (40 papers)
Benjamin Piwowarski (38 papers)
Jacopo Staiano (38 papers)
Alex Wang (32 papers)

Citations (246)

View on Semantic Scholar

Summary

QuestEval: Advancements in Summarization Evaluation

The paper QuestEval: Summarization Asks for Fact-based Evaluation introduces a novel framework designed to address the limitations of existing automatic evaluation metrics for text summarization, primarily focusing on improving their correlation with human evaluations. QuestEval is proposed as a reference-less evaluation metric that leverages question answering (QA) models to assess consistency, coherence, fluency, and relevance in summaries, offering a more robust alternative to traditional metrics such as ROUGE and BERTScore.

Context and Motivation

Automatic summarization evaluation remains a challenging task in natural language generation (NLG). Existing metrics, such as ROUGE, rely heavily on comparison with a single reference summary and use n-gram overlap, which often leads to low correlation with human judgments. The advent of neural QA models has opened new avenues in evaluation methodologies, yet these approaches struggled to outperform ROUGE consistently. The paper positions QuestEval as a breakthrough by extending previous question generation (QG) and QA-based techniques, eliminating the need for reference summaries and improving alignment with human evaluations across multiple dimensions.

Technical Contributions

The main contributions of QuestEval are summarized as follows:

Unified Framework: QuestEval integrates precision and recall-based QA metrics, blending them into a singular framework that evaluates both factual consistency and relevance without requiring human-generated reference summaries.
Question Weighing and Saliency Estimation: QuestEval incorporates a method for learning the saliency of generated queries, thereby enhancing its ability to select crucial information from the document. This is achieved via a query weighter module, which identifies important questions based on training data derived from existing summarization datasets.
Empirical Evaluation: The paper validates QuestEval on datasets with annotated summaries from CNN/Daily Mail and XSUM, demonstrating state-of-the-art results in correlation with human judgments across all evaluation dimensions. Strong numerical evidence highlights QuestEval’s efficacy, particularly at assessing factual consistency.

Experimental Findings

The experiments offer compelling evidence of QuestEval's superiority over existing metrics. On the SummEval dataset, QuestEval exhibits significantly higher correlation with human judgments compared to ROUGE, METEOR, BLEU, and BERTScore, achieving average correlations up to 33.5 for single-reference evaluations. Notably, QuestEval robustly handles scenarios with no reference summaries, providing stable results due to its reference-less nature. This adaptability positions QuestEval as a versatile tool across different summarization datasets.

Implications and Future Work

QuestEval’s introduction marks a substantial development in the landscape of summarization evaluation, aligning more closely with the multifaceted nature of human assessment, which encompasses consistency, fluency, and relevance. On a practical level, QuestEval facilitates the comparison of summarization systems across different corpora without dependency on exhaustive gold-standard references.

From a theoretical perspective, the integration of QA models in evaluation frameworks sets a precedent, advocating for a more nuanced understanding of generation tasks through the prism of factual consistency and information importance. Future work could explore the adaptation of QuestEval to other NLG tasks such as machine translation and text simplification, which similarly suffer from limitations in existing evaluation metrics. Additionally, further explorations could involve translating QuestEval into multilingual contexts, exploring its applicability beyond the English language.

In conclusion, QuestEval emerges as a potent tool in the domain of summarization evaluation, advocating for an evidence-based, reference-free approach that shows promise in advancing the accuracy and reliability of automatic summary evaluations.

PDF Markdown

Related Papers

Find Related Papers