Evaluating the Factual Consistency of LLMs Through News Summarization
The paper "Evaluating the Factual Consistency of LLMs Through News Summarization" provides an exploration of the propensity of LLMs to maintain factual consistency when generating news summaries. This research is particularly relevant in the context of natural language generation (NLG), where LLMs, while advanced in many respects, are known to exhibit hallucinatory behavior—that is, generating information not present in the source material.
To address this, the authors present the Factual Inconsistency Benchmark (FIB), a new benchmark designed to evaluate whether LLMs prefer factually consistent document continuations over inconsistent ones. The benchmark evaluates model performance by comparing the scores that LLMs assign to human-verified factually consistent summaries versus factually inconsistent summaries generated from various summarization models.
The authors conducted a comprehensive evaluation involving 23 LLMs, ranging from 1 billion to 176 billion parameters, sourced from six model families including BLOOM and OPT. The findings indicate that LLMs generally favor factually consistent summaries, assigning higher scores to them compared to factually inconsistent ones. However, a notable exception was observed: when factually inconsistent summaries contained verbatim content from the input document, LLMs displayed a higher tendency to favor these inconsistent summaries.
Noteworthy is the methodology for creating factually inconsistent summaries: they are generated via 22 models and subsequently annotated manually. The FIB is built upon summaries from the XSum and CNN/DM datasets, offering a robust testbed for abstractive and extractive summarization tasks, respectively.
The research offers several key insights:
- Factual Consistency Preference: LLMs generally demonstrate a bias towards consistent summaries. For instance, BLOOM shows an adherence to this preference 72.4% of the time.
- Verbatim Pitfalls: Despite this general preference, LLMs rarely prefer consistent summaries over inconsistent ones if the latter are extracted verbatim from input documents, exemplified by BLOOM's mere 9.6% preference rate in these scenarios.
- Scale and Consistency: There is an observed trend of increasing factual consistency with the scale of the model parameters.
- FactCC Efficacy: FactCC-generated factually inconsistent summaries pose a significant challenge to some LLMs, as they are often rated similarly to manually generated inconsistent summaries.
The paper contributes valuable tools and methodologies for assessing the factuality of LLMs, offering a critical view of their performance across models with different sizes and pretraining paradigms. The research paves the way for future work aimed at improving LLMs' handling of factual information, potentially extending these methodologies to other domains such as scientific literature and QA systems.
The benchmark and findings highlight the necessity for further advancements in ensuring LLMs' outputs are not only coherent but factually accurate, thus enhancing the reliability of AI-driven tasks in real-world applications. This work also emphasizes the importance of developing refined techniques for LLM evaluation, particularly those that incorporate nuanced mechanisms such as pointwise mutual information to better gauge model performance.