Evaluating the Factual Consistency of Abstractive Text Summarization

Published 28 Oct 2019 in cs.CL | (1910.12840v1)

Abstract: Currently used metrics for assessing summarization algorithms do not account for whether summaries are factually consistent with source documents. We propose a weakly-supervised, model-based approach for verifying factual consistency and identifying conflicts between source documents and a generated summary. Training data is generated by applying a series of rule-based transformations to the sentences of source documents. The factual consistency model is then trained jointly for three tasks: 1) identify whether sentences remain factually consistent after transformation, 2) extract a span in the source documents to support the consistency prediction, 3) extract a span in the summary sentence that is inconsistent if one exists. Transferring this model to summaries generated by several state-of-the art models reveals that this highly scalable approach substantially outperforms previous models, including those trained with strong supervision using standard datasets for natural language inference and fact checking. Additionally, human evaluation shows that the auxiliary span extraction tasks provide useful assistance in the process of verifying factual consistency.

Abstract PDF Upgrade to Chat

Citations (677)

View on Semantic Scholar

Summary

The paper presents a novel weakly-supervised framework using a BERT-based model to evaluate and enhance factual consistency in generated summaries.
It employs data augmentation techniques—such as paraphrasing, entity swapping, pronoun substitution, and noise injection—to create effective training data.
Results show that FactCC achieves a weighted accuracy of 74.15% over baselines, while FactCCX provides traceable explanations to aid human annotation.

Evaluating the Factual Consistency of Abstractive Text Summarization

Overview

This paper presents a novel approach to evaluating and improving the factual consistency of summaries generated by state-of-the-art neural models. The research emphasizes the inadequacy of existing summarization evaluation protocols concerning factual accuracy, proposing a model-based framework that operates in a weakly-supervised setting.

Methodology

The approach leverages a BERT-based architecture to evaluate consistency, integrating mechanisms for explanatory feedback. A key component of this work is the generation of weakly-supervised training data through semantically invariant and variant transformations applied to source documents. These transformations encompass:

Paraphrasing: Using neural machine translation for back-translation.
Entity and Number Swapping: Substituting entities and numbers with alternatives from the source text.
Pronoun Swapping: Random substitution of pronouns.
Sentence Negation: Altering auxiliary verbs to change sentence polarity.
Noise Injection: Random token duplication or removal.

The generated dataset permits the model to learn identifying factual inconsistencies across document-sentence contexts. The authors also develop an explainable model variant (FactCCX), which highlights supporting and conflicting text segments, aiding human annotators in consistency verification.

Results

Evaluation on manually annotated test sets reveals significant performance improvements over models trained on traditional NLI and fact-checking datasets like MNLI and FEVER. FactCC achieves a weighted accuracy of 74.15%, markedly surpassing baselines such as BERT+MNLI and BERT+FEVER.

Additionally, FactCCX demonstrates that inclusion of explanatory modules marginally sacrifices classification performance but provides beneficial traceability of the model's decision-making process. Human-based experiments indicate that model-generated highlights expedite annotation and enhance agreement among annotators.

Implications and Future Directions

This research contributes important methodologies and insights for addressing a critical limitation in neural summarization—factual reliability. The use of weakly-supervised data generation positively shifts training paradigms toward domain and error-specific improvements.

However, the study highlights limitations such as common-sense reasoning and cross-sentence dependencies, suggesting pathways for future exploration. Incorporating advanced data augmentation strategies or commonsense knowledge could potentially enhance the handling of complex errors.

This work underscores the critical need for further research in ensuring factual consistency, potentially driving advances in both model architecture and auxiliary training methodologies. The proposed techniques pave the way for more reliable and trustworthy summarization systems, crucial for their deployment in critical real-world applications.

Markdown