Insights into Automatic Machine Translation Evaluation: Reference Quality Matters
In the paper "BLEU might be Guilty but References are not Innocent," the authors critically examine the evaluation methodologies utilized in assessing machine translation (MT) quality, specifically focusing on the interplay between automatic metrics and the reference translations they rely on. The central hypothesis they explore is that traditional reference collections may inadvertently bias automatic metrics like BLEU, particularly favoring outputs that mimic the style of translationese. Consequently, the paper investigates alternative reference preparation techniques, with a focus on enhancing their diversity and decreasing translationese artifacts.
Highlights and Methodology
The authors argue that the quality of reference translations is paramount in ensuring reliable automatic metrics evaluation. Their paper demonstrates that references typically collected with conventional human translation methodologies exhibit poor diversity, which can lead to bias in automatic evaluations and inaccurately penalize systems producing fluent and adequate outputs. To address this issue, they propose a novel paraphrasing approach, wherein linguists are tasked with producing diverse paraphrased versions of the reference translations.
The methodology involves collecting multiple types of references for the WMT 2019 English-German translation task. In addition to traditional reference translations, the authors collected additional human translations and instructed linguists to produce paraphrased references. This allowed them to investigate how paraphrasing can impact metric reliability, particularly in improving correlation with human assessments.
Numerical Results and Observations
The paper presents empirical evidence that paraphrased references yield higher correlation with human judgments across several automatic evaluation metrics, including BLEU, BERTScore, and Yisi. This correlation was particularly prominent in high-quality system submissions, suggesting paraphrased references provide a more accurate evaluation in competitive scenarios.
Additionally, the authors tested multi-reference BLEU approaches, comparing traditional multiple reference sets with novel compositions using high-quality selections among alternatives. Despite conventional assumptions, multi-reference BLEU did not demonstrate enhanced correlation with human evaluation compared to single-reference BLEU when references shared similar translationese characteristics.
Implications and Future Directions
The findings of this paper hold significant implications for the development and evaluation of MT systems. Firstly, the results suggest that automated metrics might systematically undervalue techniques known to produce more natural outputs, such as back-translation and automatic post-editing augmentation, due to biases toward translationese-style references. Consequently, the research underscores the necessity for reevaluating evaluation frameworks to accommodate more diverse and nuanced reference sets.
Given these insights, future developments in AI could focus on refining paraphrasing methodologies to further increase the fidelity of automated evaluations. Additionally, while paraphrasing enhances correlation with human judgments, the complexity it introduces in human evaluation should prompt further investigation into how rating methodologies might be adapted or expanded to improve assessment reliability.
Ultimately, the paper advocates for industry-wide adoption of diversified reference translations, particularly favoring paraphrasing in assessing MT outputs. The release of their paraphrased reference datasets invites the broader research community to reassess previous reliability claims and desirably recalibrate evaluation processes to account for more naturally fluent outputs.