BLEURT: Learning Robust Metrics for Text Generation (2004.04696v5)

Published 9 Apr 2020 in cs.CL

Abstract: Text generation has made significant advances in the last few years. Yet, evaluation metrics have lagged behind, as the most popular choices (e.g., BLEU and ROUGE) may correlate poorly with human judgments. We propose BLEURT, a learned evaluation metric based on BERT that can model human judgments with a few thousand possibly biased training examples. A key aspect of our approach is a novel pre-training scheme that uses millions of synthetic examples to help the model generalize. BLEURT provides state-of-the-art results on the last three years of the WMT Metrics shared task and the WebNLG Competition dataset. In contrast to a vanilla BERT-based approach, it yields superior results even when the training data is scarce and out-of-distribution.

PDF Abstract

BLEURT: Learning Robust Metrics for Text Generation

Overview

The paper "BLEURT: Learning Robust Metrics for Text Generation" introduces BLEURT, a learned evaluation metric based on BERT, designed to better correlate with human judgments in natural language generation (NLG) tasks. Evaluation metrics play a critical role in NLG by providing a proxy for quality assessment that can be computed efficiently and at low cost. Traditional metrics such as BLEU and ROUGE, while popular, have significant limitations, particularly in their inability to accommodate semantic and syntactic variations beyond mere lexical overlap. BLEURT aims to address these gaps by leveraging both pre-training on synthetic data and fine-tuning on human ratings.

Methodology

Pre-Training and Fine-Tuning

The authors adopt a two-stage training process for BLEURT:

Pre-Training: BLEURT is pre-trained on synthetic data generated from Wikipedia sentences through methods such as mask-filling with BERT, back-translation, and word dropout. This step is crucial for enriching the model's capacity to handle various lexical and semantic variations.
Fine-Tuning: Subsequent to pre-training, BLEURT is fine-tuned on a smaller set of human-rated data, focusing on specific text generation tasks. This step aligns BLEURT’s predictions more closely with human assessments.

Pre-Training Tasks and Signals

The pre-training phase utilizes multiple supervision signals:

Automatic Metrics: BLEU, ROUGE, and BERTscore to measure string overlap and semantic similarity.
Backtranslation Likelihood: Measures the probability that one sentence is a backtranslation of another, focusing on maintaining semantic content.
Textual Entailment: Classifies sentence pairs into entailment, contradiction, or neutral categories.
Backtranslation Flag: Indicates whether the perturbations were generated with backtranslation or mask-filling.

The aggregation of these tasks is managed through a multi-task learning framework to enhance the robustness and generalizability of BLEURT.

Experimental Evaluation

Performance on WMT Metrics Shared Task

The BLEURT metric was evaluated using datasets from the WMT Metrics Shared Task over the years 2017 to 2019. The metric was benchmarked against other state-of-the-art metrics, including sentenceBLEU, METEOR, and BERTscore. BLEURT consistently outperformed these baselines, demonstrating strong correlations with human judgments. The detailed results are:

WMT 2017: BLEURT achieved the highest Kendall Tau and Pearson correlation across several language pairs.
WMT 2018 and 2019: Similar outcomes were observed with adjustments for noisier data and direct assessment metrics employed by the organizers.

Robustness to Quality Drift

The authors conducted robustness tests to simulate quality drifts in training and test distributions. BLEURT's performance remained stable relative to traditional metrics, even as the discrepancy between training and test distributions increased. This underscores the efficacy of pre-training in enhancing the metric's robustness.

WebNLG Dataset Performance

The ability of BLEURT to generalize to different NLG tasks was tested using the WebNLG 2017 Challenge dataset. BLEURT exhibited rapid adaptability with limited training data, outperforming conventional metrics like BLEU and METEOR, especially when pre-training steps were included.

Ablation Studies

The ablation studies provided insights into the contributions of various pre-training tasks. High-quality signals like BLEURTscore and entailment consistently improved performance, while metrics less correlated with human judgment (e.g., BLEU) were less beneficial.

Implications and Future Work

The implications of BLEURT are twofold:

Practical: BLEURT offers a more accurate and robust automatic metric for evaluating text generation systems, potentially accelerating development cycles by reducing reliance on expensive and time-consuming human evaluations.
Theoretical: The model’s success in leveraging synthetic data for pre-training highlights the potential for scalable, generalizable pre-training methods in other NLG contexts.

Future work may focus on extending BLEURT's capabilities to multilingual evaluation scenarios and exploring hybrid evaluation frameworks that incorporate both automatic metrics and human judgments.

Conclusion

BLEURT represents a significant advancement in automatic evaluation metrics for text generation, combining the strengths of BERT with innovative pre-training techniques. Its ability to generalize across domains and maintain robustness under quality drifts positions it as a valuable tool for advancing NLG research and applications.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Thibault Sellam (19 papers)
Dipanjan Das (42 papers)
Ankur P. Parikh (28 papers)

Citations (1,337)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/EthanLazuk/status/1795625532916441233

YouTube

Show All Videos