Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation (2107.10821v2)

Published 22 Jul 2021 in cs.CL
To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

Abstract: Automatic metrics are commonly used as the exclusive tool for declaring the superiority of one machine translation system's quality over another. The community choice of automatic metric guides research directions and industrial developments by deciding which models are deemed better. Evaluating metrics correlations with sets of human judgements has been limited by the size of these sets. In this paper, we corroborate how reliable metrics are in contrast to human judgements on -- to the best of our knowledge -- the largest collection of judgements reported in the literature. Arguably, pairwise rankings of two systems are the most common evaluation tasks in research or deployment scenarios. Taking human judgement as a gold standard, we investigate which metrics have the highest accuracy in predicting translation quality rankings for such system pairs. Furthermore, we evaluate the performance of various metrics across different language pairs and domains. Lastly, we show that the sole use of BLEU impeded the development of improved models leading to bad deployment decisions. We release the collection of 2.3M sentence-level human judgements for 4380 systems for further analysis and replication of our work.

Evaluation of Automatic Metrics for Machine Translation

This paper presents a comprehensive evaluation of automatic metrics used in the assessment of machine translation (MT) systems, emphasizing their efficacy compared to human judgments. The authors identify an overarching reliance on automatic metrics, often in place of more time-consuming and costly human evaluations, which can lead to incorrect determinations of system quality and suboptimal development directions. By leveraging what is identified as the largest collection of human judgment data in the field—comprising 2.3 million sentence-level evaluations across 4380 systems—the authors provide detailed insights into the alignment between various metrics and human assessments.

Key Findings

  1. Comparison with Human Judgments: The authors demonstrate that automatic metrics often poorly approximate human judgment. Common metrics tend to be influenced by specific factors, such as translationese phenomena, or they neglect the gravity of translation errors. This can result in misguided development decisions.
  2. Metric Evaluation Methodology: The paper proposes a novel methodology for the pairwise system-level evaluation of metrics, focusing on their capacity to predict quality rankings. By considering human judgment as the gold standard, they evaluate which metrics most accurately predict translation quality rankings.
  3. Limitation of BLEU: The paper highlights that over-reliance on BLEU could have stymied advancements in MT model development. BLEU's inadequacy contributes to incorrect deployment decisions, masking potential improvements in system outputs.
  4. Recommendations: The paper suggests best practices for utilizing automatic metrics:
    • Use pretrained models like COMET as primary metrics.
    • Employ string-based metrics like ChrF for unsupported languages and secondary verification.
    • Avoid BLEU given its inferior performance.
    • Conduct paired significance tests to mitigate erroneous judgments due to random variation.
    • Release system outputs on public test sets for reproducibility and further analysis.
  5. Metric Performance: Pretrained methods generally surpass string-based metrics in accuracy across various scenarios, including language pairs and domains. COMET emerges as the most reliable metric for pairwise comparisons, while ChrF is recommended for scenarios where more advanced metrics are infeasible.

Implications and Future Directions

The findings have practical implications that underline the need for a shift in the standard practices of MT evaluation. The clear inadequacy of BLEU as a standalone metric suggests that the community should transition towards adopting more complex, yet reliable metrics like COMET. This shift should be strategically implemented to ensure that innovation is not thwarted by historical inertia or ease of use.

The release of the extensive human judgment dataset provides an invaluable resource for replication studies and the development of future metrics. Such openness in data sharing fosters collaborative efforts to refine MT systems and their evaluation, which is critical for the field's long-term progression.

On a theoretical level, the research points to a broader understanding of how automatic metrics can be optimized to reflect human judgment more accurately. The effectiveness of COMET-src as a reference-free metric opens new avenues for evaluating translation outputs in resource-constrained contexts.

In conclusion, this paper provides robust evidence for re-evaluating the metrics traditionally used in MT research and proposes a clear path forward for enhancing metric reliability, resulting in more accurate assessments of machine translation systems. The continued evolution and adoption of refined metrics will undoubtedly play a pivotal role in the field's advancement.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Tom Kocmi (29 papers)
  2. Christian Federmann (9 papers)
  3. Roman Grundkiewicz (16 papers)
  4. Marcin Junczys-Dowmunt (29 papers)
  5. Hitokazu Matsushita (3 papers)
  6. Arul Menezes (15 papers)
Citations (190)