Quality and Quantity of Machine Translation References for Automatic Metrics (2401.01283v5)

Published 2 Jan 2024 in cs.CL

Abstract: Automatic machine translation metrics typically rely on human translations to determine the quality of system translations. Common wisdom in the field dictates that the human references should be of very high quality. However, there are no cost-benefit analyses that could be used to guide practitioners who plan to collect references for machine translation evaluation. We find that higher-quality references lead to better metric correlations with humans at the segment-level. Having up to 7 references per segment and taking their average (or maximum) helps all metrics. Interestingly, the references from vendors of different qualities can be mixed together and improve metric success. Higher quality references, however, cost more to create and we frame this as an optimization problem: given a specific budget, what references should be collected to maximize metric success. These findings can be used by evaluators of shared tasks when references need to be created under a certain budget.

PDF HTML Abstract

Analyzing the Quality and Quantity of Machine Translation References for Automatic Metrics

The paper "Quality and Quantity of Machine Translation References for Automatic Metrics" critically examines the influence of reference quality and quantity on the efficacy of automatic machine translation (MT) evaluation metrics. The research is centralized around a key query: How do varying levels of reference quality and the inclusion of additional reference translations impact the performance of automated MT metrics, and what is the optimal balance when faced with a budget constraint?

The research acknowledges the prevailing dependence of automatic MT evaluation on high-quality human reference translations. Human annotators are often recognized as the gold standard; however, this method is neither scalable nor replicable, and automatic metrics have emerged as a viable alternative by quantitatively correlating machine-generated translations with human assessments.

The researchers conducted experiments using data in an English-to-Czech translation task, incorporating references of differing qualities, from standard translation vendors to academia-level fidelity. The data was re-annotated to provide a multi-reference dataset, enabling a thorough examination of these factors.

Findings

Impact of Reference Quality: The paper establishes that the reference quality considerably influences metric performance. References of poor quality, as anticipated, degrade the correlation between automatic scores and human assessments. The research cites examples where highest-quality human translations (termed "optimal reference translations") do not always yield the highest correlation scores, potentially due to translation shifts that challenge surface-level matching in metrics like BLEU.
Advantage of Multiple References: Incorporating multiple references, particularly when averaged or maximized, significantly enhances metric correlations. Improvements plateau after incorporating around seven references, beyond which the benefit diminishes. This finding aligns with previous research indicating that the number of references can substitute for or complement test set size in enhancing metric reliability.
Budget Allocation for References: Importantly, the paper presents an algorithmic approach to optimize reference selection under a budget constraint, balancing the quality and number of references to maximize metric success. This pragmatic frame helps navigate the trade-off between high-quality and cost intensiveness.

Implications

Practically, this paper informs MT practitioners on the cost-effective construction of reference corpora by elucidating that a mixture of reference qualities can still improve metric efficacy. It underscores the dual significance of reference quality and quantity in automated MT evaluation, advocating for a strategic selection informed by available resources and project demands.

Theoretically, the findings invite further research into robust metric design that can account for subtle translation shifts and better align with semantic rather than surface-level similarities. The adaptability of current metrics to high-quality human translations that deviate from translationese remains a sector for exploration, especially in an era where MT models increasingly emulate human-like fluency.

Speculations on Future Developments

Looking forward, the ongoing advancements in MT systems—particularly in dealing with nuanced semantic shifts—could drive the evolution of metrics that better harness sophisticated reference translations. Additionally, given the growing computational abilities to generate synthetic references, future work could explore leveraging neural MT systems to produce auxiliary references, potentially enhancing reference diversity cost-effectively.

In conclusion, this research contributes a detailed analysis and methodology for optimizing reference use in MT evaluation, setting a foundational approach for future community standards and helping guide resource allocation in the development of MT evaluation benchmarks.