Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BLEU might be Guilty but References are not Innocent (2004.06063v2)

Published 13 Apr 2020 in cs.CL, cs.AI, and cs.LG

Abstract: The quality of automatic metrics for machine translation has been increasingly called into question, especially for high-quality systems. This paper demonstrates that, while choice of metric is important, the nature of the references is also critical. We study different methods to collect references and compare their value in automated evaluation by reporting correlation with human evaluation for a variety of systems and metrics. Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task for linguists to perform on existing reference translations, which counteracts this bias. Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output, which have been shown to have low correlation with automatic metrics using standard references. We demonstrate that our methodology improves correlation with all modern evaluation metrics we look at, including embedding-based methods. To complete this picture, we reveal that multi-reference BLEU does not improve the correlation for high quality output, and present an alternative multi-reference formulation that is more effective.

Insights into Automatic Machine Translation Evaluation: Reference Quality Matters

In the paper "BLEU might be Guilty but References are not Innocent," the authors critically examine the evaluation methodologies utilized in assessing machine translation (MT) quality, specifically focusing on the interplay between automatic metrics and the reference translations they rely on. The central hypothesis they explore is that traditional reference collections may inadvertently bias automatic metrics like BLEU, particularly favoring outputs that mimic the style of translationese. Consequently, the paper investigates alternative reference preparation techniques, with a focus on enhancing their diversity and decreasing translationese artifacts.

Highlights and Methodology

The authors argue that the quality of reference translations is paramount in ensuring reliable automatic metrics evaluation. Their paper demonstrates that references typically collected with conventional human translation methodologies exhibit poor diversity, which can lead to bias in automatic evaluations and inaccurately penalize systems producing fluent and adequate outputs. To address this issue, they propose a novel paraphrasing approach, wherein linguists are tasked with producing diverse paraphrased versions of the reference translations.

The methodology involves collecting multiple types of references for the WMT 2019 English-German translation task. In addition to traditional reference translations, the authors collected additional human translations and instructed linguists to produce paraphrased references. This allowed them to investigate how paraphrasing can impact metric reliability, particularly in improving correlation with human assessments.

Numerical Results and Observations

The paper presents empirical evidence that paraphrased references yield higher correlation with human judgments across several automatic evaluation metrics, including BLEU, BERTScore, and Yisi. This correlation was particularly prominent in high-quality system submissions, suggesting paraphrased references provide a more accurate evaluation in competitive scenarios.

Additionally, the authors tested multi-reference BLEU approaches, comparing traditional multiple reference sets with novel compositions using high-quality selections among alternatives. Despite conventional assumptions, multi-reference BLEU did not demonstrate enhanced correlation with human evaluation compared to single-reference BLEU when references shared similar translationese characteristics.

Implications and Future Directions

The findings of this paper hold significant implications for the development and evaluation of MT systems. Firstly, the results suggest that automated metrics might systematically undervalue techniques known to produce more natural outputs, such as back-translation and automatic post-editing augmentation, due to biases toward translationese-style references. Consequently, the research underscores the necessity for reevaluating evaluation frameworks to accommodate more diverse and nuanced reference sets.

Given these insights, future developments in AI could focus on refining paraphrasing methodologies to further increase the fidelity of automated evaluations. Additionally, while paraphrasing enhances correlation with human judgments, the complexity it introduces in human evaluation should prompt further investigation into how rating methodologies might be adapted or expanded to improve assessment reliability.

Ultimately, the paper advocates for industry-wide adoption of diversified reference translations, particularly favoring paraphrasing in assessing MT outputs. The release of their paraphrased reference datasets invites the broader research community to reassess previous reliability claims and desirably recalibrate evaluation processes to account for more naturally fluent outputs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Markus Freitag (49 papers)
  2. David Grangier (55 papers)
  3. Isaac Caswell (19 papers)
Citations (141)
Github Logo Streamline Icon: https://streamlinehq.com