Papers
Topics
Authors
Recent
Search
2000 character limit reached

BARTScore: Evaluating Generated Text as Text Generation

Published 22 Jun 2021 in cs.CL | (2106.11520v2)

Abstract: A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference output or the source text will achieve higher scores when the generated text is better. We operationalize this idea using BART, an encoder-decoder based pre-trained model, and propose a metric BARTScore with a number of variants that can be flexibly applied in an unsupervised fashion to evaluation of text from different perspectives (e.g. informativeness, fluency, or factuality). BARTScore is conceptually simple and empirically effective. It can outperform existing top-scoring metrics in 16 of 22 test settings, covering evaluation of 16 datasets (e.g., machine translation, text summarization) and 7 different perspectives (e.g., informativeness, factuality). Code to calculate BARTScore is available at https://github.com/neulab/BARTScore, and we have released an interactive leaderboard for meta-evaluation at http://explainaboard.nlpedia.ai/leaderboard/task-meval/ on the ExplainaBoard platform, which allows us to interactively understand the strengths, weaknesses, and complementarity of each metric.

Citations (714)

Summary

  • The paper introduces BARTScore, a novel metric that treats text evaluation as a text generation task using the pre-trained BART model.
  • It leverages textual prompts and variant inputs to measure informativeness, coherence, and factuality across 16 datasets.
  • Empirical results show BARTScore outperforms state-of-the-art metrics in 16 out of 22 settings, highlighting its practical potential in NLP.

BARTScore: A Novel Approach for Evaluating Text Generation Quality Using Pre-Trained Models

Overview

In the field of NLP, assessing the quality of text generated by models for tasks such as machine translation, text summarization, or dialog generation remains a crucial yet challenging aspect. A new metric, BARTScore, aims to tackle this challenge by conceptualizing the evaluation of generated text as a text generation problem itself. Utilizing the BART (Bidirectional and Auto-Regressive Transformers) pre-trained sequence-to-sequence model, this metric assesses generated text based on its probability of being generated from or generating reference outputs or source text. The flexibility and effectiveness of BARTScore, demonstrated across various datasets and perspectives of evaluation, highlight its potential as a significant contributor to the field of text generation evaluation.

Methodology

BARTScore operates on a simple yet powerful premise: if a generated text can be accurately predicted from its corresponding reference or source text using a pre-trained model, it is considered high-quality. This approach leverages the full capacity of the BART model, including its pre-trained parameters, with no need for additional fine-tuning on human judgment data.

Primarily, BARTScore offers several variants catering to different evaluation perspectives such as informativeness, coherence, and factuality. These are achieved by adjusting the conditional text generation problem's inputs and outputs. Moreover, BARTScore's utility is further extended through textual prompts that align the evaluation task closer to the pre-trained model's original training objectives. The flexibility of BARTScore allows for its application in both supervised and unsupervised fashions across varying tasks.

Empirical Evaluation

The effectiveness of BARTScore was rigorously tested across 16 datasets, considering seven distinct evaluation perspectives. Empirical results indicate that BARTScore outperforms current leading metrics in 16 out of 22 test settings. For instance, in machine translation evaluation for the German-English language pair, BARTScore showcased a significant improvement in correlation with human judgments by incorporating simple textual prompts. These findings underscore BARTScore's potential to provide a nuanced and comprehensive evaluation of generated text from multiple dimensions.

Theoretical Implications

The introduction and success of BARTScore raise several theoretical considerations regarding the methodology of text generation evaluation. Firstly, the approach of viewing evaluation itself as a generation problem represents a paradigm shift, suggesting that understanding the quality of generated text can be intrinsically linked to the models used to generate such text in the first place. Secondly, the reliance on pre-trained models like BART underscores the emerging consensus around the utility of such models across various NLP tasks, including evaluation metrics. Lastly, the capacity of BARTScore to evaluate text from diverse perspectives without explicit reliance on human judgment data further challenges the existing frameworks of text evaluation, potentially reducing the resources required for comprehensive evaluation.

Practical Implications and Future Directions

From a practical standpoint, BARTScore offers an efficient and versatile tool for developers and researchers involved in text generation tasks. By providing a robust mechanism for automatic evaluation, it could significantly streamline the development process, enabling quicker iterations and refinements of models. Additionally, the success of BARTScore opens avenues for exploration in leveraging other pre-trained models for evaluation purposes, potentially leading to a suite of similarly effective metrics.

Moreover, the application of textual prompts within BARTScore presents an exciting area for future research, particularly in discovering optimal prompts for different text generation tasks and languages. This could lead to further advancements in both the precision and applicability of automatic evaluation metrics.

In conclusion, BARTScore represents a significant stride toward a more nuanced and efficient evaluation of text generation. By harnessing the capabilities of pre-trained models like BART, it offers a flexible and effective tool for assessing the quality of generated text across a multitude of perspectives, setting a new standard for future developments in the field of NLP.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.