Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation (1808.10432v1)

Published 30 Aug 2018 in cs.CL

Abstract: We reassess a recent study (Hassan et al., 2018) that claimed that machine translation (MT) has reached human parity for the translation of news from Chinese into English, using pairwise ranking and considering three variables that were not taken into account in that previous study: the language in which the source side of the test set was originally written, the translation proficiency of the evaluators, and the provision of inter-sentential context. If we consider only original source text (i.e. not translated from another language, or translationese), then we find evidence showing that human parity has not been achieved. We compare the judgments of professional translators against those of non-experts and discover that those of the experts result in higher inter-annotator agreement and better discrimination between human and machine translations. In addition, we analyse the human translations of the test set and identify important translation issues. Finally, based on these findings, we provide a set of recommendations for future human evaluations of MT.

Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation

This paper, authored by Antonio Toral et al., undertakes a critical reassessment of a prior paper by Microsoft asserting that neural machine translation (NMT) systems have achieved human parity in translating news from Chinese to English. The authors meticulously examine this claim by incorporating additional factors into their evaluation, including the original language of the source text, the evaluators' proficiency, and the availability of inter-sentential context. The comprehensive analysis challenges the notion of human parity, emphasizing that the claim, when assessed under these new conditions, does not hold.

The paper begins by identifying potential limitations in the Microsoft paper. Specifically, it addresses the inclusion of translated text, which can simplify the translation process due to features inherent in translationese such as simplification, explicitation, and normalization. This initial evaluation suggests that using translationese as source input may favor machine translation systems. The authors, therefore, propose a refined methodology focusing exclusively on original source material in their evaluation.

Furthermore, the paper differentiates between evaluations conducted by professional translators and those by non-experts. This is particularly crucial as results show that professional translators demonstrated higher inter-annotator agreement and were better at distinguishing between human and machine-generated translations. The paper posits that non-experts, such as bilingual crowd workers, may possess insufficient awareness of translation nuances, leading to less discriminative assessments.

Inter-sentential context constitutes another dimension of this reassessment. Evaluating sentences in isolation can overlook important document-level coherence and cohesion, elements that human translators naturally consider but present challenges for NMT systems. The authors argue for a document-level evaluation approach, providing a more holistic assessment of translation quality.

The statistical analyses conducted in this paper are carefully detailed. Results demonstrate that, when considering original source text exclusively, human translations outperform those generated by NMT systems. Additionally, evaluators were consistently more critical when allowed to view sentences within their document context, highlighting the importance of such an approach in translation evaluation.

In the implications section, the authors offer a set of recommendations aimed at improving future human evaluations of machine translation systems. These include ensuring the source language of test sets corresponds with the target translation, employing professional translators for evaluations, incorporating inter-sentential context, and using high-quality human translations as reference data. These recommendations attempt to counteract the potential overhyping of NMT systems and underscore the need for rigorous evaluation methodologies.

This reassessment provides profound insights into the capabilities and challenges of neural machine translation systems. While the paper refutes claims of achieving human parity under specific conditions and underscores the ongoing challenges in replicating human translation quality, it also acknowledges the substantial progress made in the field. The paper’s findings prompt considerations for future advancements in NMT technologies, advocating for a more methodical evaluation framework that can more precisely gauge milestones in translation quality.

Overall, the paper serves as a critical reminder that despite significant developments in machine translation, the nuanced complexity of human language, contextual understanding, and the art of translation remain formidable challenges that currently limit the realization of human parity within neural machine translation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Antonio Toral (35 papers)
  2. Sheila Castilho (6 papers)
  3. Ke Hu (57 papers)
  4. Andy Way (46 papers)
Citations (181)