How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs (1612.04629v3)

Published 14 Dec 2016 in cs.CL

Abstract: Analysing translation quality in regards to specific linguistic phenomena has historically been difficult and time-consuming. Neural machine translation has the attractive property that it can produce scores for arbitrary translations, and we propose a novel method to assess how well NMT systems model specific linguistic phenomena such as agreement over long distances, the production of novel words, and the faithful translation of polarity. The core idea is that we measure whether a reference translation is more probable under a NMT model than a contrastive translation which introduces a specific type of error. We present LingEval97, a large-scale data set of 97000 contrastive translation pairs based on the WMT English->German translation task, with errors automatically created with simple rules. We report results for a number of systems, and find that recently introduced character-level NMT systems perform better at transliteration than models with byte-pair encoding (BPE) segmentation, but perform more poorly at morphosyntactic agreement, and translating discontiguous units of meaning.

Authors (1)

Rico Sennrich (88 papers)

Citations (162)

View on Semantic Scholar

Summary

Evaluation of Grammaticality in Character-level Neural Machine Translation

In the paper "How Grammatical is Character-level Neural Machine Translation? Assessing MT Quality with Contrastive Translation Pairs," the author Rico Sennrich presents an innovative approach to evaluating neural machine translation (NMT) systems. The primary aim is to understand how well these systems model specific linguistic phenomena, a task that has been challenging due to inadequate metrics like BLEU or coarse-grained error analysis. Utilizing the inherent scoring capability of neural models, the author introduces a novel method involving contrastive translation pairs to identify translation errors and assess grammaticality.

Key Methodology

The paper introduces LingEval97, a comprehensive dataset comprising 97,000 contrastive translation pairs derived from the WMT English-to-German translation task. Each pair consists of a correct reference translation and a contrastive translation with a specific error. This allows for the automated evaluation of distinct aspects such as long-distance agreement, transliteration, and polarity translation fidelity. The evaluation metric used is the frequency with which a model assigns a higher probability to the reference translation compared to the contrastive translation, offering insight into the system's handling of specific linguistic phenomena.

Experimental Evaluation

Three different NMT systems were evaluated: BPE-to-BPE, BPE-to-char, and char-to-char, with hyperparameters optimized for the respective architectures. Notably, the paper reveals that character-level models exhibit superior performance in transliteration tasks compared to byte-pair encoding (BPE) models. However, they demonstrate inferior performance in morphosyntactic agreement and translating discontiguous verb-particle constructions, particularly as the distance between the agreeing elements increases.

The findings emphasize the trade-off inherent in character-level decoders. While offering advantages in handling novel words, including transliterations, they fall short in maintaining grammatical consistency across longer sequences. This limitation is evident in subject-verb agreement, especially when agreement is required over extended segments.

Implications and Future Directions

The paper proposes a significant methodological advancement in the evaluation of NMT systems, enriching the understanding of linguistic inadequacies in translation models. These insights provide valuable implications for further research. The use of contrastive translation pairs could foster the development of hybrid models that aim to overcome the observed trade-off by integrating the strengths of both character- and subword-level processing.

Moreover, by facilitating a nuanced assessment of translation errors, the proposed method could stimulate advancements in both model architecture and training paradigms, paving the way for more linguistically robust translation systems. Future research could explore alternative architectures like dilated convolutional networks or hybrid word-character models, assessing their performance on these challenging linguistic phenomena.

Conclusion

Rico Sennrich’s work introduces a novel evaluation framework that sheds light on specific linguistic capabilities of NMT systems. By focusing on granular translation errors through contrastive pairs, the paper advocates for a shift towards more targeted error diagnostics in machine translation evaluation, offering substantial contributions to both theoretical understanding and practical advancements in the field of NMT. This methodology promises to be a valuable tool in driving improvements in translation accuracy and grammatical coherence across diverse language pairs.

Related Papers

Find Related Papers