Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BARTScore: Evaluating Generated Text as Text Generation (2106.11520v2)

Published 22 Jun 2021 in cs.CL

Abstract: A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference output or the source text will achieve higher scores when the generated text is better. We operationalize this idea using BART, an encoder-decoder based pre-trained model, and propose a metric BARTScore with a number of variants that can be flexibly applied in an unsupervised fashion to evaluation of text from different perspectives (e.g. informativeness, fluency, or factuality). BARTScore is conceptually simple and empirically effective. It can outperform existing top-scoring metrics in 16 of 22 test settings, covering evaluation of 16 datasets (e.g., machine translation, text summarization) and 7 different perspectives (e.g., informativeness, factuality). Code to calculate BARTScore is available at https://github.com/neulab/BARTScore, and we have released an interactive leaderboard for meta-evaluation at http://explainaboard.nlpedia.ai/leaderboard/task-meval/ on the ExplainaBoard platform, which allows us to interactively understand the strengths, weaknesses, and complementarity of each metric.

BARTScore: A Novel Approach for Evaluating Text Generation Quality Using Pre-Trained Models

Overview

In the field of NLP, assessing the quality of text generated by models for tasks such as machine translation, text summarization, or dialog generation remains a crucial yet challenging aspect. A new metric, BARTScore, aims to tackle this challenge by conceptualizing the evaluation of generated text as a text generation problem itself. Utilizing the BART (Bidirectional and Auto-Regressive Transformers) pre-trained sequence-to-sequence model, this metric assesses generated text based on its probability of being generated from or generating reference outputs or source text. The flexibility and effectiveness of BARTScore, demonstrated across various datasets and perspectives of evaluation, highlight its potential as a significant contributor to the field of text generation evaluation.

Methodology

BARTScore operates on a simple yet powerful premise: if a generated text can be accurately predicted from its corresponding reference or source text using a pre-trained model, it is considered high-quality. This approach leverages the full capacity of the BART model, including its pre-trained parameters, with no need for additional fine-tuning on human judgment data.

Primarily, BARTScore offers several variants catering to different evaluation perspectives such as informativeness, coherence, and factuality. These are achieved by adjusting the conditional text generation problem's inputs and outputs. Moreover, BARTScore's utility is further extended through textual prompts that align the evaluation task closer to the pre-trained model's original training objectives. The flexibility of BARTScore allows for its application in both supervised and unsupervised fashions across varying tasks.

Empirical Evaluation

The effectiveness of BARTScore was rigorously tested across 16 datasets, considering seven distinct evaluation perspectives. Empirical results indicate that BARTScore outperforms current leading metrics in 16 out of 22 test settings. For instance, in machine translation evaluation for the German-English language pair, BARTScore showcased a significant improvement in correlation with human judgments by incorporating simple textual prompts. These findings underscore BARTScore's potential to provide a nuanced and comprehensive evaluation of generated text from multiple dimensions.

Theoretical Implications

The introduction and success of BARTScore raise several theoretical considerations regarding the methodology of text generation evaluation. Firstly, the approach of viewing evaluation itself as a generation problem represents a paradigm shift, suggesting that understanding the quality of generated text can be intrinsically linked to the models used to generate such text in the first place. Secondly, the reliance on pre-trained models like BART underscores the emerging consensus around the utility of such models across various NLP tasks, including evaluation metrics. Lastly, the capacity of BARTScore to evaluate text from diverse perspectives without explicit reliance on human judgment data further challenges the existing frameworks of text evaluation, potentially reducing the resources required for comprehensive evaluation.

Practical Implications and Future Directions

From a practical standpoint, BARTScore offers an efficient and versatile tool for developers and researchers involved in text generation tasks. By providing a robust mechanism for automatic evaluation, it could significantly streamline the development process, enabling quicker iterations and refinements of models. Additionally, the success of BARTScore opens avenues for exploration in leveraging other pre-trained models for evaluation purposes, potentially leading to a suite of similarly effective metrics.

Moreover, the application of textual prompts within BARTScore presents an exciting area for future research, particularly in discovering optimal prompts for different text generation tasks and languages. This could lead to further advancements in both the precision and applicability of automatic evaluation metrics.

In conclusion, BARTScore represents a significant stride toward a more nuanced and efficient evaluation of text generation. By harnessing the capabilities of pre-trained models like BART, it offers a flexible and effective tool for assessing the quality of generated text across a multitude of perspectives, setting a new standard for future developments in the field of NLP.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Weizhe Yuan (25 papers)
  2. Graham Neubig (342 papers)
  3. Pengfei Liu (191 papers)
Citations (714)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com