Evaluating the Factual Consistency of Abstractive Text Summarization
Overview
This paper presents a novel approach to evaluating and improving the factual consistency of summaries generated by state-of-the-art neural models. The research emphasizes the inadequacy of existing summarization evaluation protocols concerning factual accuracy, proposing a model-based framework that operates in a weakly-supervised setting.
Methodology
The approach leverages a BERT-based architecture to evaluate consistency, integrating mechanisms for explanatory feedback. A key component of this work is the generation of weakly-supervised training data through semantically invariant and variant transformations applied to source documents. These transformations encompass:
- Paraphrasing: Using neural machine translation for back-translation.
- Entity and Number Swapping: Substituting entities and numbers with alternatives from the source text.
- Pronoun Swapping: Random substitution of pronouns.
- Sentence Negation: Altering auxiliary verbs to change sentence polarity.
- Noise Injection: Random token duplication or removal.
The generated dataset permits the model to learn identifying factual inconsistencies across document-sentence contexts. The authors also develop an explainable model variant (FactCCX), which highlights supporting and conflicting text segments, aiding human annotators in consistency verification.
Results
Evaluation on manually annotated test sets reveals significant performance improvements over models trained on traditional NLI and fact-checking datasets like MNLI and FEVER. FactCC achieves a weighted accuracy of 74.15%, markedly surpassing baselines such as BERT+MNLI and BERT+FEVER.
Additionally, FactCCX demonstrates that inclusion of explanatory modules marginally sacrifices classification performance but provides beneficial traceability of the model's decision-making process. Human-based experiments indicate that model-generated highlights expedite annotation and enhance agreement among annotators.
Implications and Future Directions
This research contributes important methodologies and insights for addressing a critical limitation in neural summarization—factual reliability. The use of weakly-supervised data generation positively shifts training paradigms toward domain and error-specific improvements.
However, the paper highlights limitations such as common-sense reasoning and cross-sentence dependencies, suggesting pathways for future exploration. Incorporating advanced data augmentation strategies or commonsense knowledge could potentially enhance the handling of complex errors.
This work underscores the critical need for further research in ensuring factual consistency, potentially driving advances in both model architecture and auxiliary training methodologies. The proposed techniques pave the way for more reliable and trustworthy summarization systems, crucial for their deployment in critical real-world applications.