Papers
Topics
Authors
Recent
Search
2000 character limit reached

Towards a Unified Multi-Dimensional Evaluator for Text Generation

Published 13 Oct 2022 in cs.CL | (2210.07197v1)

Abstract: Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions. Furthermore, thanks to the unified Boolean QA format, we are able to introduce an intermediate learning phase that enables UniEval to incorporate external knowledge from multiple related tasks and gain further improvement. Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics. Specifically, compared to the top-performing unified evaluators, UniEval achieves a 23% higher correlation on text summarization, and over 43% on dialogue response generation. Also, UniEval demonstrates a strong zero-shot learning ability for unseen evaluation dimensions and tasks. Source code, data and all pre-trained evaluators are available on our GitHub repository (https://github.com/maszhongming/UniEval).

Citations (214)

Summary

  • The paper introduces UniEval, a novel evaluation framework that converts NLG tasks into Boolean QA for comprehensive multi-dimensional analysis.
  • UniEval leverages pseudo data construction and multi-task learning to robustly evaluate coherence, fluency, and relevance in text generation.
  • Experimental results show a 23% improvement in correlation with human judgments in summarization and strong zero-shot evaluation capabilities.

Towards a Unified Multi-Dimensional Evaluator for Text Generation

The paper "Towards a Unified Multi-Dimensional Evaluator for Text Generation" (2210.07197) addresses the challenge of evaluating Natural Language Generation (NLG) models using a more comprehensive framework than traditional similarity-based metrics. It introduces UniEval, a unified multi-dimensional evaluator, which re-frames evaluation tasks as Boolean QA problems, enabling a single model to assess multiple dimensions like coherence and fluency.

Introduction

Traditional NLG evaluation metrics such as ROUGE and BLEU focus on lexical overlap between generated text and reference text, sometimes missing nuances of quality as models advance in capability. UniEval reimagines evaluation by converting NLG assessments into Boolean QA tasks, thus allowing for multi-dimensional and more accurate evaluations that align closely with human judgment.

Conversion of evaluation dimensions into a Boolean QA format allows UniEval to ask specific yes/no questions regarding the evaluation of dimensions like coherence ("Is this a coherent summary to the document?"). This design simplifies the use of evaluators in multi-dimensional settings and enhances the incorporation of external knowledge via intermediate multi-task learning. Figure 1

Figure 1: The overall framework of UniEval, converting NLG tasks into a unified QA format for evaluation.

Method

UniEval operates by training a model to evaluate text across different dimensions using pseudo data construction and multiple supervised stages. It constructs synthetic positive and negative samples based on designed rules to train on tasks like coherence, consistency, fluency, and relevance in NLG.

Beyond evaluation tasks, UniEval introduces intermediate multi-task learning from broader QA, linguistic, and self-supervised datasets. This stage enhances external knowledge absorption necessary for complex evaluations. By leveraging datasets including BoolQ and CoLA, UniEval increases its robustness across varied dimensions, transferring efficiently even to unseen tasks and dimensions.

Experiments and Results

UniEval demonstrates superior predictive correlation with human judgments across multiple NLG tasks compared to state-of-the-art single-dimensional and unified evaluators. In summarization, UniEval achieves a 23% higher correlation for coherence, relevancy, and fluency dimensions than BARTScore.

Furthermore, UniEval shows remarkable zero-shot learning ability, proving effective for unseen evaluation dimensions by simple question modifications. It outperforms existing evaluators in dialogue response generation tasks, achieving substantial improvements in transfer and extensibility, as exemplified by its performance in evaluation on the "understandability" dimension. Figure 2

Figure 2: Zero-shot performance on the "understandability" dimension in dialogue response generation.

Conclusion

UniEval sets a new direction for developing evaluators that can accommodate various dimensions and tasks seamlessly. By unifying evaluation methods into a QA format, UniEval simplifies the model process while incorporating external knowledge from diverse datasets to improve evaluation quality. Its strong correlation with human assessments and zero-shot adaptability suggest promising advancements in deploying multi-dimensional evaluators in NLG systems.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.