BLEURT: Learning Robust Metrics for Text Generation
Overview
The paper "BLEURT: Learning Robust Metrics for Text Generation" introduces BLEURT, a learned evaluation metric based on BERT, designed to better correlate with human judgments in natural language generation (NLG) tasks. Evaluation metrics play a critical role in NLG by providing a proxy for quality assessment that can be computed efficiently and at low cost. Traditional metrics such as BLEU and ROUGE, while popular, have significant limitations, particularly in their inability to accommodate semantic and syntactic variations beyond mere lexical overlap. BLEURT aims to address these gaps by leveraging both pre-training on synthetic data and fine-tuning on human ratings.
Methodology
Pre-Training and Fine-Tuning
The authors adopt a two-stage training process for BLEURT:
- Pre-Training: BLEURT is pre-trained on synthetic data generated from Wikipedia sentences through methods such as mask-filling with BERT, back-translation, and word dropout. This step is crucial for enriching the model's capacity to handle various lexical and semantic variations.
- Fine-Tuning: Subsequent to pre-training, BLEURT is fine-tuned on a smaller set of human-rated data, focusing on specific text generation tasks. This step aligns BLEURT’s predictions more closely with human assessments.
Pre-Training Tasks and Signals
The pre-training phase utilizes multiple supervision signals:
- Automatic Metrics: BLEU, ROUGE, and BERTscore to measure string overlap and semantic similarity.
- Backtranslation Likelihood: Measures the probability that one sentence is a backtranslation of another, focusing on maintaining semantic content.
- Textual Entailment: Classifies sentence pairs into entailment, contradiction, or neutral categories.
- Backtranslation Flag: Indicates whether the perturbations were generated with backtranslation or mask-filling.
The aggregation of these tasks is managed through a multi-task learning framework to enhance the robustness and generalizability of BLEURT.
Experimental Evaluation
Performance on WMT Metrics Shared Task
The BLEURT metric was evaluated using datasets from the WMT Metrics Shared Task over the years 2017 to 2019. The metric was benchmarked against other state-of-the-art metrics, including sentenceBLEU, METEOR, and BERTscore. BLEURT consistently outperformed these baselines, demonstrating strong correlations with human judgments. The detailed results are:
- WMT 2017: BLEURT achieved the highest Kendall Tau and Pearson correlation across several language pairs.
- WMT 2018 and 2019: Similar outcomes were observed with adjustments for noisier data and direct assessment metrics employed by the organizers.
Robustness to Quality Drift
The authors conducted robustness tests to simulate quality drifts in training and test distributions. BLEURT's performance remained stable relative to traditional metrics, even as the discrepancy between training and test distributions increased. This underscores the efficacy of pre-training in enhancing the metric's robustness.
WebNLG Dataset Performance
The ability of BLEURT to generalize to different NLG tasks was tested using the WebNLG 2017 Challenge dataset. BLEURT exhibited rapid adaptability with limited training data, outperforming conventional metrics like BLEU and METEOR, especially when pre-training steps were included.
Ablation Studies
The ablation studies provided insights into the contributions of various pre-training tasks. High-quality signals like BLEURTscore and entailment consistently improved performance, while metrics less correlated with human judgment (e.g., BLEU) were less beneficial.
Implications and Future Work
The implications of BLEURT are twofold:
- Practical: BLEURT offers a more accurate and robust automatic metric for evaluating text generation systems, potentially accelerating development cycles by reducing reliance on expensive and time-consuming human evaluations.
- Theoretical: The model’s success in leveraging synthetic data for pre-training highlights the potential for scalable, generalizable pre-training methods in other NLG contexts.
Future work may focus on extending BLEURT's capabilities to multilingual evaluation scenarios and exploring hybrid evaluation frameworks that incorporate both automatic metrics and human judgments.
Conclusion
BLEURT represents a significant advancement in automatic evaluation metrics for text generation, combining the strengths of BERT with innovative pre-training techniques. Its ability to generalize across domains and maintain robustness under quality drifts positions it as a valuable tool for advancing NLG research and applications.