ChatGPT as a Factual Inconsistency Evaluator for Text Summarization

Published 27 Mar 2023 in cs.CL | (2303.15621v2)

Abstract: The performance of text summarization has been greatly boosted by pre-trained LLMs. A main concern of existing methods is that most generated summaries are not factually inconsistent with their source documents. To alleviate the problem, many efforts have focused on developing effective factuality evaluation metrics based on natural language inference, question answering, and syntactic dependency et al. However, these approaches are limited by either their high computational complexity or the uncertainty introduced by multi-component pipelines, resulting in only partial agreement with human judgement. Most recently, LLMs(LLMs) have shown excellent performance in not only text generation but also language comprehension. In this paper, we particularly explore ChatGPT's ability to evaluate factual inconsistency under a zero-shot setting by examining it on both coarse-grained and fine-grained evaluation tasks including binary entailment inference, summary ranking, and consistency rating. Experimental results indicate that ChatGPT generally outperforms previous evaluation metrics across the three tasks, indicating its great potential for factual inconsistency evaluation. However, a closer inspection of ChatGPT's output reveals certain limitations including its preference for more lexically similar candidates, false reasoning, and inadequate understanding of instructions.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (63)

View on Semantic Scholar

Summary

The paper introduces ChatGPT’s zero-shot approach for detecting factual inconsistencies via entailment inference, summary ranking, and consistency rating.
It demonstrates improved performance with chain-of-thought prompting, although limitations arise from lexical bias and shallow semantic understanding.
The study highlights the need for future research in fine-tuning, enhancing semantic reasoning, and refining prompt strategies for factual evaluation.

ChatGPT as a Factual Inconsistency Evaluator for Text Summarization

Introduction

The paper "ChatGPT as a Factual Inconsistency Evaluator for Text Summarization" (2303.15621) explores the potential of using ChatGPT for evaluating factual inconsistency in text summarization under zero-shot settings. The primary focus is on tasks such as binary entailment inference, summary ranking, and consistency rating. The investigation highlights ChatGPT's performance, pointing out limitations and suggesting areas for future exploration.

Coarse- and Fine-Grained Evaluation Tasks

The authors detail three specific tasks to evaluate factual inconsistency: entailment inference, summary ranking, and consistency rating.

Entailment Inference: This task is framed as a binary classification problem where the goal is to determine if a summary is consistent with the associated document. Two prompts were used: a zero-shot prompt and a zero-shot chain-of-thought (CoT) prompt. The latter includes step-by-step reasoning to enhance consistency detection.
Figure 1: The results of sensitivity and specificity of ChatGPT\textsubscript{ZS-COT}.
Summary Ranking: In contrast to binary entailment, this task evaluates whether a model can rank a consistent summary higher than an inconsistent one based on a given document. The zero-shot prompt is similar but asks ChatGPT to choose between two summaries.
Figure 2: ChatGPT's actions when given the same source document and an inconsistent summary but with and without a consistent one. The red underlined text in the article is content highly related to the candidate summaries.
Consistency Rating: This involves rating the consistency of a summary with the source article on a numeric scale. This task assesses ChatGPT's ability to provide fine-grained ratings of factual accuracy.

Evaluation and Findings

The evaluation on benchmark datasets reveals the following key findings:

ChatGPT shows promising potential for zero-shot factual inconsistency detection, outperforming prior state-of-the-art metrics across most datasets in the entailment inference task.
The zero-shot CoT prompting significantly improves performance by encouraging a more reasoned decision-making process.
However, ChatGPT demonstrates a preference for predicting high lexical similarity as consistency, sometimes lacking deeper semantic understanding, as evidenced by lower sensitivity to subtle factual inconsistencies.

Error Analysis

The paper also presents several limitations in ChatGPT's performance:

Lexical Bias: ChatGPT tends to misclassify semantically incorrect summaries as correct if they have high lexical similarity to the source text.
Figure 3: An example of ChatGPT fails to stick to the given definition of consistency.
False Reasoning: Instances were found where ChatGPT made incorrect logical inferences under CoT-style prompting, especially where explanations were influenced by initial conclusions.
Figure 4: An example of ChatGPT conducts false reasoning.

Conclusion

The preliminary results indicate that ChatGPT could serve as a viable tool for evaluating factual inconsistency in text summarization. Chain-of-thought style prompts notably enhance the model's capabilities. Despite its strengths, areas such as fine-tuning, understanding semantic entailment, and refining prompt structures remain open for future research to address current biases and reasoning shortcomings. These endeavors can lead to more reliable adoption of ChatGPT in practical summarization systems.

Markdown Report Issue