Towards a Unified Multi-Dimensional Evaluator for Text Generation (2210.07197v1)

Published 13 Oct 2022 in cs.CL

Abstract: Multi-dimensional evaluation is the dominant paradigm for human evaluation in Natural Language Generation (NLG), i.e., evaluating the generated text from multiple explainable dimensions, such as coherence and fluency. However, automatic evaluation in NLG is still dominated by similarity-based metrics, and we lack a reliable framework for a more comprehensive evaluation of advanced models. In this paper, we propose a unified multi-dimensional evaluator UniEval for NLG. We re-frame NLG evaluation as a Boolean Question Answering (QA) task, and by guiding the model with different questions, we can use one evaluator to evaluate from multiple dimensions. Furthermore, thanks to the unified Boolean QA format, we are able to introduce an intermediate learning phase that enables UniEval to incorporate external knowledge from multiple related tasks and gain further improvement. Experiments on three typical NLG tasks show that UniEval correlates substantially better with human judgments than existing metrics. Specifically, compared to the top-performing unified evaluators, UniEval achieves a 23% higher correlation on text summarization, and over 43% on dialogue response generation. Also, UniEval demonstrates a strong zero-shot learning ability for unseen evaluation dimensions and tasks. Source code, data and all pre-trained evaluators are available on our GitHub repository (https://github.com/maszhongming/UniEval).

PDF Abstract

An Expert Overview of "Towards a Unified Multi-Dimensional Evaluator for Text Generation"

The paper "Towards a Unified Multi-Dimensional Evaluator for Text Generation" introduces a novel approach, termed UniEval, which is designed to improve the evaluation of Natural Language Generation (NLG) systems. Traditional evaluation metrics such as ROUGE and BLEU primarily focus on similarity with reference text, potentially overlooking other critical quality dimensions such as coherence, consistency, fluency, and relevance. Recognizing these limitations, the authors propose a unified multi-dimensional evaluation framework based on reframing the task as a Boolean Question Answering problem, thereby enabling evaluation across multiple dimensions with a single model.

Technical Approach

UniEval leverages pre-trained LLMs and transforms the evaluation process into a Boolean QA task. By framing each dimension as a specific question, the same model can evaluate different quality metrics. This unification is accomplished by posing questions like "Is this a coherent summary?" and using the model's response probabilities to derive scores. An additional layer of learning is introduced through intermediate tasks that align with NLG objectives, enhancing the model's robustness and allowing it to incorporate external knowledge.

The training methodology is notable for adopting an unsupervised strategy utilizing pseudo data generated through rule-based transformations. This approach creates positive samples from ground truth data and derives negative samples using dimension-specific transformations, such as sentence replacement to test coherence or antonym substitution for consistency testing.

Empirical Findings

Extensive experiments performed across summarization and dialogue response generation tasks establish UniEval's superiority over existing metrics. For instance, it excels with a significant 23% higher correlation with human judgments on text summarization compared to top existing evaluators, and over 43% improvement on dialogue response generation. Moreover, UniEval exhibits strong zero-shot learning capabilities, allowing it to extend its evaluation to unseen dimensions and tasks without additional training.

Theoretical and Practical Implications

The construction of UniEval reveals both a theoretical and practical advancement in the arena of NLG evaluation. Theoretically, it reinforces the importance of multi-dimensional evaluation in fully capturing the quality of textual outputs, suggesting that surface-level metrics like similarity do not suffice for evaluating sophisticated LLMs. Practically, it offers a unified and extensible framework capable of adapting to diverse and evolving NLG tasks and dimensions, potentially reducing the need for developing task-specific evaluators.

Future Prospects in AI Development

UniEval’s approach highlights a promising direction for AI research, emphasizing adaptability and multi-faceted evaluation in AI-generated content. Future developments could further refine the transformation rules and explore additional dimensions to incorporate into this unified evaluator. Additionally, extending this evaluation approach to other languages and more complex NLG tasks could further enhance its applicability and effectiveness.

This paper contributes a robust framework to the NLG evaluation landscape, advocating for comprehensive evaluation beyond traditional metrics, a direction that holds significant promise for advancing AI's capabilities and trustworthiness in language generation tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Ming Zhong (88 papers)
Yang Liu (2253 papers)
Da Yin (35 papers)
Yuning Mao (34 papers)
Yizhu Jiao (22 papers)
Pengfei Liu (191 papers)
Chenguang Zhu (100 papers)
Heng Ji (266 papers)
Jiawei Han (263 papers)

Citations (214)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - maszhongming/UniEval: Repository for EMNLP 2022 Paper: Towards a Unified Multi-Dimensional Evaluator for Text Generation (192 stars)