Papers
Topics
Authors
Recent
2000 character limit reached

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Published 29 Mar 2023 in cs.CL and cs.AI | (2303.16634v3)

Abstract: The quality of texts generated by natural language generation (NLG) systems is hard to measure automatically. Conventional reference-based metrics, such as BLEU and ROUGE, have been shown to have relatively low correlation with human judgments, especially for tasks that require creativity and diversity. Recent studies suggest using LLMs as reference-free metrics for NLG evaluation, which have the benefit of being applicable to new tasks that lack human references. However, these LLM-based evaluators still have lower human correspondence than medium-size neural evaluators. In this work, we present G-Eval, a framework of using LLMs with chain-of-thoughts (CoT) and a form-filling paradigm, to assess the quality of NLG outputs. We experiment with two generation tasks, text summarization and dialogue generation. We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin. We also propose preliminary analysis on the behavior of LLM-based evaluators, and highlight the potential issue of LLM-based evaluators having a bias towards the LLM-generated texts. The code is at https://github.com/nlpyang/geval

Citations (826)

Summary

  • The paper introduces G-Eval, which uses GPT-4 and chain-of-thought reasoning to provide a more human-aligned evaluation of NLG outputs.
  • The methodology involves a three-stage pipeline with prompt-based evaluation, CoT generation, and a probability-weighted scoring function.
  • The experimental results show superior performance on benchmarks like SummEval, Topical-Chat, and QAGS, while noting potential bias towards LLM-generated outputs.

G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment

Introduction

Evaluating the quality of texts generated by NLG systems is a complex challenge, particularly due to the limitations of conventional metrics such as BLEU and ROUGE, which often show a low correlation with human judgments. This is increasingly apparent in tasks requiring creativity and diversity. Recent approaches propose using LLMs as reference-free metrics for evaluating NLG outputs. However, these approaches frequently exhibit lower correspondence with human assessments compared to medium-sized neural models. Here, G-Eval is introduced as a novel framework employing GPT-4, leveraging CoT reasoning to enhance the evaluation of NLG outputs across diverse tasks. Figure 1

Figure 1: The overall framework of G-Eval. We first input Task Introduction and Evaluation Criteria to the LLM, and ask it to generate a CoT of detailed Evaluation Steps. Then we use the prompt along with the generated CoT to evaluate the NLG outputs in a form-filling paradigm. Finally, we use the probability-weighted summation of the output scores as the final score.

Methodology

G-Eval utilizes a three-stage pipeline: a prompt for NLG evaluation, generation of a CoT, and a scoring function.

Prompt-based Evaluator

The initial phase involves crafting a detailed prompt that articulates the evaluation task and criteria, thereby setting the evaluation's foundation. In this stage, task-specific criteria such as coherence or conciseness are included in the prompt, providing necessary context for subsequent processing.

Chain-of-Thought (CoT) Generation

The CoT is generated iteratively by the LLM, furnishing intermediate evaluative steps. For example, evaluating coherence in text summarization may involve checking the logical flow between summary sentences. This structured reasoning enables the model to assess NLG outputs more effectively.

Scoring Function

The scoring employs a probability-weighted summation of discrete score outcomes generated by the LLM. This approach addresses two key issues: the predominance of a single score value and the limitations of integer-only outputs. By utilizing token probabilities for normalizing scores, G-Eval provides a more nuanced and continuous evaluation metric.

Experimental Evaluation

The framework's efficacy was validated across multiple benchmarks: SummEval, Topical-Chat, and QAGS, demonstrating superior performance in correlating model assessments with human judgments. SummEval results highlighted the improved correlation of G-Eval-4 with human evaluations, surpassing previous state-of-the-art metrics, including GPTScore and BARTScore. In particular, the method's adeptness in handling creativity-driven and consistency-focused tasks underscores its robustness. Figure 2

Figure 2: Averaged G-Eval-4's scores for human-written summaries and GPT-3.5 summaries, divided by human judges' preference.

Analysis and Implications

A notable finding from this study is the inherent bias of G-Eval towards LLM-generated outputs, potentially due to shared criteria during text generation and evaluation. Although this introduces a risk of self-reinforcement if the evaluator's outputs serve as a feedback mechanism for LLM training, it also emphasizes the need for transparency and oversight in using LLM-based evaluators.

Furthermore, the effect of CoT significantly boosts model performance, particularly in enhancing coherence metrics. Probability normalization, while affecting Kendall-Tau correlation, offers higher Spearman correlation, indicating a refined ranking ability of generated text quality.

Conclusion

G-Eval represents an advancement in the landscape of NLG evaluation, leveraging the abilities of LLMs, CoT reasoning, and a probability-weighted scoring mechanism. This approach not only surpasses traditional and recent evaluation metrics but also presents a more human-aligned assessment framework. Future work should focus on exploring mitigative strategies for LLM-preferred biases and expanding the framework’s adaptability across emerging NLG tasks.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 24 tweets with 470 likes about this paper.