Assessing ChatGPT as an NLG Evaluation Metric
The paper, titled "Is ChatGPT a Good NLG Evaluator? A Preliminary Study," explores exploring the efficacy of ChatGPT as an evaluation metric for natural language generation (NLG) models. The primary focus of the paper is to address the challenges associated with evaluating NLG models, highlighting the notorious poor correlation of traditional metrics with human judgments. The paper offers a preliminary meta-evaluation to test the potential of ChatGPT in filling this evaluative role.
The authors adopt a methodology where ChatGPT is prompted to act like a human evaluator, with task-specific and aspect-specific instructions provided to judge the outputs of NLG models. They perform experiments across five distinct NLG meta-evaluation datasets encompassing tasks like summarization, story generation, and data-to-text generation. A notable finding is that ChatGPT demonstrates state-of-the-art or competitive correlation with human judgments, particularly excelling in the story generation task. However, the paper also identifies constraints: the effectiveness of ChatGPT as an evaluator is affected by the creation methodology of the meta-evaluation datasets and inherent biases when datasets rely heavily on references.
Experimental Insights and Results
The evaluation involved leveraging ChatGPT with both reference-free and reference-based prompts. Specific prompt styles, such as direct assessment (DA) and one-to-five stars ranking, were explored to gauge their impact on the evaluation outcomes. The paper addresses several types of metrics, including both traditional -gram (e.g., ROUGE) and embedding-based ones (e.g., BERTScore, MoverScore), while introducing ChatGPT as an LLM-based metric.
- Text Summarization: ChatGPT surpassed existing automatic metrics in datasets like SummEval, corroborating its potential to serve as a superior evaluation metric. However, its performance varied based on dataset biases; notably, it underperformed on RealSumm due to high lexical biases inherent in the pyramid recall method employed by this dataset.
- Story Generation: In this domain, where open-ended and creative outputs are prevalent, ChatGPT demonstrated exemplary performance, significantly outperforming traditional similarity-based metrics. It was particularly more aligned with human judgment, which underscores its effectiveness in less structured NLG tasks.
- Data-to-Text Generation: The paper also examined BAGEL for data-to-text tasks, where ChatGPT provided competitive correlations with human judgments, although it didn't consistently outperform all baselines.
Implications and Future Directions
The findings indicate an encouraging potential for ChatGPT to serve as a generalized NLG evaluation metric, especially in creative and subjective NLG tasks where traditional metrics are less effective. However, results are sensitive to the prompt design, which needs to be carefully crafted for different tasks and aspects. This sensitivity underscores the importance of prompt engineering in maximizing ChatGPT's utility as an evaluator.
The paper implicitly suggests future research avenues, such as optimizing prompt designs and exploring LLM evaluators' capabilities across diverse languages and cross-lingual tasks. Moreover, considering the limitations introduced by dataset biases, there is a call for developing more robust and challenging meta-evaluation datasets that can provide a fair assessment of both NLG models and evaluation metrics like ChatGPT.
In conclusion, while ChatGPT shows promise as an NLG evaluator, further exploration and refinement are necessary to fully establish its utility and reliability across the variety of challenges presented by NLG tasks. The paper sets a foundation for the integration of LLMs into NLG evaluation, paving the way for advancements in assessment methodologies within the computational linguistics community.