Asking and Answering Questions to Evaluate the Factual Consistency of Summaries (2004.04228v1)

Published 8 Apr 2020 in cs.CL

Abstract: Practical applications of abstractive summarization models are limited by frequent factual inconsistencies with respect to their input. Existing automatic evaluation metrics for summarization are largely insensitive to such errors. We propose an automatic evaluation protocol called QAGS (pronounced "kags") that is designed to identify factual inconsistencies in a generated summary. QAGS is based on the intuition that if we ask questions about a summary and its source, we will receive similar answers if the summary is factually consistent with the source. To evaluate QAGS, we collect human judgments of factual consistency on model-generated summaries for the CNN/DailyMail (Hermann et al., 2015) and XSUM (Narayan et al., 2018) summarization datasets. QAGS has substantially higher correlations with these judgments than other automatic evaluation metrics. Also, QAGS offers a natural form of interpretability: The answers and questions generated while computing QAGS indicate which tokens of a summary are inconsistent and why. We believe QAGS is a promising tool in automatically generating usable and factually consistent text.

Authors (3)

Alex Wang (32 papers)
Kyunghyun Cho (292 papers)
Mike Lewis (78 papers)

Citations (434)

View on Semantic Scholar

Summary

Evaluating Factual Consistency in Summarization Using QAGS

The paper "Asking and Answering Questions to Evaluate the Factual Consistency of Summaries" presents a novel approach to assessing the factual correctness of summaries generated by contemporary abstractive summarization models. The researchers introduce QAGS (Question Answering and Generation for Summarization), an automatic evaluation protocol that leverages advances in question answering (QA) and question generation (QG) to detect discrepancies between a summary and its source text.

Introduction to the Problem and Proposed Solution

The primary impediment to the deployment of abstractive summarization systems in practical settings is their frequent production of factually inconsistent summaries. Traditional metrics, such as ROUGE and BLEU, focus on $n$ -gram overlap and fail to adequately capture semantic alignment, thereby necessitating human evaluators for reliable assessment, a process associated with significant time and financial costs.

The QAGS framework innovates upon existing methodologies by automatically identifying factual inconsistencies without the need for reference summaries. The underpinning rationale is straightforward: a summary and its source should yield similar responses to the same set of questions if they are factually aligned. The methodology employs a three-step framework: question generation based on the summary, QA on both the summary and source, and comparison of the answers using a similarity metric.

Experimental Validation and Results

The authors validate QAGS using human judgments of factual consistency on summaries produced from the CNN/DailyMail and XSUM datasets. QAGS outperforms standard automatic metrics by achieving higher correlations with human evaluations, with notable Pearson correlation coefficients—54.52 on the CNN/DailyMail task versus 17.72 for ROUGE-2, and strong performance on XSUM as well. This improvement underscores QAGS's capability to discern factual errors more effectively than $n$ -gram-based metrics.

Robustness and Implications

An ablation paper demonstrates QAGS's robustness, highlighting its resilience to variations in QG and QA model quality, as well as the domain-mismatch. Despite using different underlying models, QAGS maintains superior correlations with human judgments, suggesting it is not overly sensitive to specific model instantiations or hyperparameter settings.

The paper's insights extend beyond the proposed summarization metric to broader implications in automated evaluation. By focusing on the content that holds semantic weight, QAGS zeroes in on the core of factuality, irrespective of sentence structure variations and lexical surface forms, presenting a significant step forward in measurement accuracy within NLP tasks.

Implications for Future Research

Practically, QAGS introduces a feasible path for enhancing the reliability of summarization systems deployed in domains where factual accuracy is non-negotiable. Theoretically, it opens avenues for research on leveraging QA capabilities in other language tasks, such as machine translation or content synthesis. Future work might explore its adaptation to additional modalities or its applicability in one-shot settings where domain-specific QA datasets may be unavailable.

Conclusion

QAGS represents a significant methodological advancement in evaluating the factual consistency of generated text, improving upon traditional metrics and offering new directions for both research and application in conditional text generation. Its introduction is timely amidst growing reliance on AI-generated content, and it stands as a promising tool for ensuring the factual integrity of machine-generated outputs. The approach beckons further exploration into its broader applications and refinement to accommodate diverse text generation tasks.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos