A Call for Clarity in Reporting BLEU Scores (1804.08771v2)

Published 23 Apr 2018 in cs.CL

Abstract: The field of machine translation faces an under-recognized problem because of inconsistency in the reporting of scores from its dominant metric. Although people refer to "the" BLEU score, BLEU is in fact a parameterized metric whose values can vary wildly with changes to these parameters. These parameters are often not reported or are hard to find, and consequently, BLEU scores between papers cannot be directly compared. I quantify this variation, finding differences as high as 1.8 between commonly used configurations. The main culprit is different tokenization and normalization schemes applied to the reference. Pointing to the success of the parsing community, I suggest machine translation researchers settle upon the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not allow for user-supplied reference processing, and provide a new tool, SacreBLEU, to facilitate this.

Citations (2,746)

View on Semantic Scholar

Summary

The paper highlights how parameter settings and preprocessing variations can cause up to 1.8 BLEU point discrepancies in machine translation evaluations.
It introduces SacreBLEU, a Python tool that standardizes BLEU scoring by automating reference handling and ensuring metric-internal processing.
Uniform BLEU reporting enhances result reproducibility and comparability across studies, making evaluations more reliable in MT research.

A Call for Clarity in Reporting BLEU Scores

In the paper "A Call for Clarity in Reporting BLEU Scores," Matt Post addresses a critical, yet often overlooked inconsistency in the field of machine translation (MT) research: the reporting methodology of BLEU scores. BLEU, which stands for Bilingual Evaluation Understudy, is a metric that has become the de facto standard for evaluating MT systems due to its ease of computation and relative language independence. Despite its widespread use, Post underscores that BLEU is not a monolithic metric but a parameterized one, which invariably leads to variances in the reported scores.

Core Issues in BLEU Reporting

The crux of Post's argument is that the variability in BLEU scores arises from differences in parameter settings and preprocessing schemes. The paper outlines several key issues:

Parameter Dependence: BLEU scores depend on various parameters such as the number of references, n-gram length, and smoothing methods. Most notably, issues arise in multi-reference settings, affecting the length penalty computation and thereby the final score.
Preprocessing Variability: Preprocessing steps like tokenization and normalization significantly impact BLEU scores. For example, different tokenization schemes can lead to discrepancies as high as 1.8 points, making cross-paper comparisons unreliable.
Opaque Reporting: Many papers fail to report their BLEU configuration in sufficient detail, thereby making the replication of results and direct comparisons between studies exceedingly difficult.

Quantification of BLEU Variability

To substantiate these issues, Post presents empirical data. Table 1 in the paper illustrates BLEU score variations across different WMT'17 languages (e.g., English to German, Russian, Finnish, etc.), showing a range up to 1.8 BLEU points depending on the tokenization and normalization schemes used. This variation is often larger than the reported gains of new methods, revealing the critical nature of uniform reporting standards.

Proposed Solutions

Post suggests following the standard used by the Conference on Machine Translation (WMT), where metric-internal tokenization is employed uniformly for all references. This approach would standardize the preprocessing and tokenization steps, thereby making scores directly comparable across different studies. To facilitate this, Post introduces a Python package named "SacreBLEU," designed to automate reference handling and ensure consistent BLEU calculations.

The features of SacreBLEU include:

Automatic download and storage of reference test sets for common datasets such as WMT and IWSLT.
Application of metric-internal preprocessing to ensure consistent scoring.
Generation of a version string that details the settings used for easy reporting and verification.

SacreBLEU aims to remove the user from the preprocessing loop, thereby mitigating the risk of introducing inconsistencies.

Implications and Future Directions

The implications of adopting a standardized BLEU calculation method are substantial for both practical and theoretical aspects of MT research. Practically, consistent BLEU reporting will enhance the reproducibility of results and the comparability of different MT systems. Theoretically, it brings clarity and uniformity to a field where reported improvements are often marginal and can be easily overshadowed by inconsistencies in evaluation.

Future developments in this domain might include extending the standardization efforts to incorporate other evaluation metrics that are gaining traction in the community, such as CHRF or METEOR. Moreover, as MT models continue to evolve, incorporating more complex architectures such as those based on neural networks, standardized evaluation practices will become even more critical.

Conclusion

Matt Post’s paper brings much-needed attention to the inconsistencies in BLEU score reporting. By highlighting the significant variation introduced by different parameter settings and preprocessing schemes, Post makes a compelling case for the adoption of standardized practices, specifically those used by WMT. The introduction of SacreBLEU serves as a practical tool to implement these standards, facilitating better comparability and reproducibility in MT research. Uniformity in BLEU reporting will undoubtedly contribute to clearer scientific communication and more robust advancements in the field of machine translation.

PDF Markdown