Is ChatGPT a Good NLG Evaluator? A Preliminary Study (2303.04048v3)

Published 7 Mar 2023 in cs.CL and cs.AI

Abstract: Recently, the emergence of ChatGPT has attracted wide attention from the computational linguistics community. Many prior studies have shown that ChatGPT achieves remarkable performance on various NLP tasks in terms of automatic evaluation metrics. However, the ability of ChatGPT to serve as an evaluation metric is still underexplored. Considering assessing the quality of natural language generation (NLG) models is an arduous task and NLG metrics notoriously show their poor correlation with human judgments, we wonder whether ChatGPT is a good NLG evaluation metric. In this report, we provide a preliminary meta-evaluation on ChatGPT to show its reliability as an NLG metric. In detail, we regard ChatGPT as a human evaluator and give task-specific (e.g., summarization) and aspect-specific (e.g., relevance) instruction to prompt ChatGPT to evaluate the generated results of NLG models. We conduct experiments on five NLG meta-evaluation datasets (including summarization, story generation and data-to-text tasks). Experimental results show that compared with previous automatic metrics, ChatGPT achieves state-of-the-art or competitive correlation with human judgments in most cases. In addition, we find that the effectiveness of the ChatGPT evaluator might be influenced by the creation method of the meta-evaluation datasets. For the meta-evaluation datasets which are created greatly depending on the reference and thus are biased, the ChatGPT evaluator might lose its effectiveness. We hope our preliminary study could prompt the emergence of a general-purposed reliable NLG metric.

Citations (366)

View on Semantic Scholar

Summary

The paper demonstrates that ChatGPT can achieve state-of-the-art correlation with human judgment, particularly in creative tasks like story generation.
The study employs tailored prompt designs for both reference-based and reference-free evaluations, highlighting the impact of dataset biases on performance.
The paper suggests further research on optimizing prompt engineering and developing robust meta-evaluation datasets to strengthen NLG assessment methodologies.

Assessing ChatGPT as an NLG Evaluation Metric

The paper, titled "Is ChatGPT a Good NLG Evaluator? A Preliminary Study," explores exploring the efficacy of ChatGPT as an evaluation metric for natural language generation (NLG) models. The primary focus of the paper is to address the challenges associated with evaluating NLG models, highlighting the notorious poor correlation of traditional metrics with human judgments. The paper offers a preliminary meta-evaluation to test the potential of ChatGPT in filling this evaluative role.

The authors adopt a methodology where ChatGPT is prompted to act like a human evaluator, with task-specific and aspect-specific instructions provided to judge the outputs of NLG models. They perform experiments across five distinct NLG meta-evaluation datasets encompassing tasks like summarization, story generation, and data-to-text generation. A notable finding is that ChatGPT demonstrates state-of-the-art or competitive correlation with human judgments, particularly excelling in the story generation task. However, the paper also identifies constraints: the effectiveness of ChatGPT as an evaluator is affected by the creation methodology of the meta-evaluation datasets and inherent biases when datasets rely heavily on references.

Experimental Insights and Results

The evaluation involved leveraging ChatGPT with both reference-free and reference-based prompts. Specific prompt styles, such as direct assessment (DA) and one-to-five stars ranking, were explored to gauge their impact on the evaluation outcomes. The paper addresses several types of metrics, including both traditional $n$ -gram (e.g., ROUGE) and embedding-based ones (e.g., BERTScore, MoverScore), while introducing ChatGPT as an LLM-based metric.

Text Summarization: ChatGPT surpassed existing automatic metrics in datasets like SummEval, corroborating its potential to serve as a superior evaluation metric. However, its performance varied based on dataset biases; notably, it underperformed on RealSumm due to high lexical biases inherent in the pyramid recall method employed by this dataset.
Story Generation: In this domain, where open-ended and creative outputs are prevalent, ChatGPT demonstrated exemplary performance, significantly outperforming traditional similarity-based metrics. It was particularly more aligned with human judgment, which underscores its effectiveness in less structured NLG tasks.
Data-to-Text Generation: The paper also examined BAGEL for data-to-text tasks, where ChatGPT provided competitive correlations with human judgments, although it didn't consistently outperform all baselines.

Implications and Future Directions

The findings indicate an encouraging potential for ChatGPT to serve as a generalized NLG evaluation metric, especially in creative and subjective NLG tasks where traditional metrics are less effective. However, results are sensitive to the prompt design, which needs to be carefully crafted for different tasks and aspects. This sensitivity underscores the importance of prompt engineering in maximizing ChatGPT's utility as an evaluator.

The paper implicitly suggests future research avenues, such as optimizing prompt designs and exploring LLM evaluators' capabilities across diverse languages and cross-lingual tasks. Moreover, considering the limitations introduced by dataset biases, there is a call for developing more robust and challenging meta-evaluation datasets that can provide a fair assessment of both NLG models and evaluation metrics like ChatGPT.

In conclusion, while ChatGPT shows promise as an NLG evaluator, further exploration and refinement are necessary to fully establish its utility and reliability across the variety of challenges presented by NLG tasks. The paper sets a foundation for the integration of LLMs into NLG evaluation, paving the way for advancements in assessment methodologies within the computational linguistics community.

PDF Markdown

Is ChatGPT a Good NLG Evaluator? A Preliminary Study (2303.04048v3)

Summary

Assessing ChatGPT as an NLG Evaluation Metric

Experimental Insights and Results

Implications and Future Directions

Related Papers