Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation
This paper, authored by Antonio Toral et al., undertakes a critical reassessment of a prior paper by Microsoft asserting that neural machine translation (NMT) systems have achieved human parity in translating news from Chinese to English. The authors meticulously examine this claim by incorporating additional factors into their evaluation, including the original language of the source text, the evaluators' proficiency, and the availability of inter-sentential context. The comprehensive analysis challenges the notion of human parity, emphasizing that the claim, when assessed under these new conditions, does not hold.
The paper begins by identifying potential limitations in the Microsoft paper. Specifically, it addresses the inclusion of translated text, which can simplify the translation process due to features inherent in translationese such as simplification, explicitation, and normalization. This initial evaluation suggests that using translationese as source input may favor machine translation systems. The authors, therefore, propose a refined methodology focusing exclusively on original source material in their evaluation.
Furthermore, the paper differentiates between evaluations conducted by professional translators and those by non-experts. This is particularly crucial as results show that professional translators demonstrated higher inter-annotator agreement and were better at distinguishing between human and machine-generated translations. The paper posits that non-experts, such as bilingual crowd workers, may possess insufficient awareness of translation nuances, leading to less discriminative assessments.
Inter-sentential context constitutes another dimension of this reassessment. Evaluating sentences in isolation can overlook important document-level coherence and cohesion, elements that human translators naturally consider but present challenges for NMT systems. The authors argue for a document-level evaluation approach, providing a more holistic assessment of translation quality.
The statistical analyses conducted in this paper are carefully detailed. Results demonstrate that, when considering original source text exclusively, human translations outperform those generated by NMT systems. Additionally, evaluators were consistently more critical when allowed to view sentences within their document context, highlighting the importance of such an approach in translation evaluation.
In the implications section, the authors offer a set of recommendations aimed at improving future human evaluations of machine translation systems. These include ensuring the source language of test sets corresponds with the target translation, employing professional translators for evaluations, incorporating inter-sentential context, and using high-quality human translations as reference data. These recommendations attempt to counteract the potential overhyping of NMT systems and underscore the need for rigorous evaluation methodologies.
This reassessment provides profound insights into the capabilities and challenges of neural machine translation systems. While the paper refutes claims of achieving human parity under specific conditions and underscores the ongoing challenges in replicating human translation quality, it also acknowledges the substantial progress made in the field. The paper’s findings prompt considerations for future advancements in NMT technologies, advocating for a more methodical evaluation framework that can more precisely gauge milestones in translation quality.
Overall, the paper serves as a critical reminder that despite significant developments in machine translation, the nuanced complexity of human language, contextual understanding, and the art of translation remain formidable challenges that currently limit the realization of human parity within neural machine translation.