On Faithfulness and Factuality in Abstractive Summarization (2005.00661v1)

Published 2 May 2020 in cs.CL

Abstract: It is well known that the standard likelihood training and approximate decoding objectives in neural text generation models lead to less human-like responses for open-ended tasks such as LLMing and story generation. In this paper we have analyzed limitations of these models for abstractive document summarization and found that these models are highly prone to hallucinate content that is unfaithful to the input document. We conducted a large scale human evaluation of several neural abstractive summarization systems to better understand the types of hallucinations they produce. Our human annotators found substantial amounts of hallucinated content in all model generated summaries. However, our analysis does show that pretrained models are better summarizers not only in terms of raw metrics, i.e., ROUGE, but also in generating faithful and factual summaries as evaluated by humans. Furthermore, we show that textual entailment measures better correlate with faithfulness than standard metrics, potentially leading the way to automatic evaluation metrics as well as training and decoding criteria.

PDF Abstract

On Faithfulness and Factuality in Abstractive Summarization: An Expert Overview

This paper investigates the critical challenges that neural abstractive document summarization models face in ensuring the faithfulness and factuality of the summaries they generate. The paper is conducted with a meticulous and comprehensive analysis of various state-of-the-art abstractive summarization systems, including RNN, CNN, and Transformer-based models.

Introduction and Significance of the Study

Abstractive document summarization aims to produce concise summaries while preserving core information from the source documents. The research identifies a prevalent issue: models often generate hallucinated content that is not found or is factually incorrect with respect to the input document. This problem is particularly severe in the context of single-sentence summaries, as highlighted by the extreme summarization task of the XSum dataset.

Methodology and Evaluation Metrics

The paper involves the following methodologies and system evaluations:

Human Evaluation of Hallucinations: The researchers conducted a large-scale human evaluation, annotating summaries generated by several abstractive models to identify hallucinated content. Annotators classified hallucinations as either intrinsic (modified information) or extrinsic (additional information).
Automatic Metrics: The effectiveness of the models was assessed using traditional metrics such as ROUGE and BERTScore, complemented by semantic inference metrics such as textual entailment and a question-answering (QA) framework.

Key Findings

The human evaluations and analyses yielded several findings:

Prevalence of Hallucinations: Hallucinations were found in over 70% of single-sentence summaries across models. Notably, while the majority of hallucinations were extrinsically erroneous, intrinsic hallucinations also constituted a significant portion, revealing the models' tendency to misrepresent the information due to poor document-level inference.
Impact of Pretraining: Models pre-initialized with pretrained parameters, such as BertS2S, performed better both in automatic metrics and human evaluations. The pretraining allowed these models to integrate background knowledge more effectively, thus reducing the instances of erroneous hallucinations.
Correlation with Human Judgment: Traditional metrics like ROUGE and BERTScore showed weak correlation with human judgments of faithfulness and factuality. In contrast, textual entailment measures displayed moderate-to-strong correlations, indicating their potential in more accurately evaluating and guiding the development of summarization systems.

Implications and Future Directions

This research underscores the importance of faithful and factual generation in abstractive summarization models, especially for applications where accuracy and reliability are paramount. The findings suggest several avenues for improving the design and evaluation of these systems:

Enhanced Model Objectives: The development of training objectives and decoding criteria that explicitly reward faithfulness and factuality, potentially incorporating textual entailment or semantic inference.
Evaluation Frameworks: Adoption of more sophisticated automatic evaluation frameworks such as entailment-based and QA-based metrics, which better align with human judgments and can aid in model selection and tuning.
Focus on External Knowledge Integration: Continued emphasis on models that harmonize external knowledge with source content to minimize the generation of hallucinated information. Pretraining on large corpora remains a valuable strategy in this aspect.
Dataset Considerations: Addressing dataset-specific artifacts, such as reference divergence, through improved data curation and pre-processing techniques to mitigate the hallucination problem inherently.

Conclusion

This paper provides a robust analysis of the limitations and potential improvements in neural abstractive summarization models concerning faithfulness and factuality. It highlights the critical need to move beyond traditional evaluation methods and develop robust, semantically-aware metrics that align closely with human judgments. By doing so, we can foster the generation of more reliable and accurate summaries, enhancing the practical utility of summarization systems in various domains.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Joshua Maynez (28 papers)
Shashi Narayan (35 papers)
Bernd Bohnet (21 papers)
Ryan McDonald (24 papers)

Citations (902)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/WHinthorn/status/1783184460743389671