A Challenge Set Approach to Evaluating Machine Translation

Published 24 Apr 2017 in cs.CL | (1704.07431v5)

Abstract: Neural machine translation represents an exciting leap forward in translation quality. But what longstanding weaknesses does it resolve, and which remain? We address these questions with a challenge set approach to translation evaluation and error analysis. A challenge set consists of a small set of sentences, each hand-designed to probe a system's capacity to bridge a particular structural divergence between languages. To exemplify this approach, we present an English-French challenge set, and use it to analyze phrase-based and neural systems. The resulting analysis provides not only a more fine-grained picture of the strengths of neural systems, but also insight into which linguistic phenomena remain out of reach.

Abstract PDF Upgrade to Chat

Citations (172)

View on Semantic Scholar

Summary

A Challenge Set Approach to Evaluating Machine Translation

In "A Challenge Set Approach to Evaluating Machine Translation," Pierre Isabelle, Colin Cherry, and George Foster introduce a novel methodology for assessing the efficacy of machine translation (MT) systems through a challenge set focused on linguistic phenomena that pose significant translation challenges. This paper delineates the subtleties of translating between languages, particularly English to French, by probing neural machine translation's (NMT) strengths and identifying unresolved weaknesses.

Evaluation Methodology

The authors propose a challenge set consisting of carefully selected sentences constructed with specific divergence phenomena intended to evaluate MT systems’ ability to resolve complex linguistic issues. Unlike traditional evaluations that rely largely on BLEU scores and other surface-level metrics, challenge sets provide a refined assessment, focusing on challenging grammatical, lexical, and syntactic aspects rather than overall translation accuracy.

Linguistic Divergences

The evaluation is broadly divided into three types of divergences:
- Morpho-Syntactic Divergences: These include aspects such as subject-verb agreement and the subjunctive mood, where the specific grammatical information is required for correct translation, notably in morphologically rich languages like French.
- Lexico-Syntactic Divergences: These involve translations that require restructuring sentence syntax, like argument switching or handling manner-of-movement verbs.
- Purely Syntactic Divergences: These involve syntactic patterns unique to specific languages, such as pronoun placement or stranded prepositions, which demand significant manipulation during translation.

Results and Analysis

The study evaluated four MT systems: two phrase-based systems (PBMT-1 and PBMT-2) and two NMT systems (nematus NMT and Google's GNMT). Results show that NMT systems significantly surpass phrase-based systems, particularly in morpho-syntactic errors, demonstrating that NMT effectively handles more of the complex subject-verb agreement phenomena and other intricacies. For instance, GNMT excelled at translating idiomatic expressions and coping with complex structures involving WH-movement and pronoun clitics.

Nonetheless, the study reveals prevailing challenges in NMT systems, such as dealing with idiomatic expressions and discrepancies in semantic generalizations. Even GNMT, with extensive data and advanced architecture, falls short in certain linguistic aspects. The analysis underscores NMT’s reliance on vast datasets to perform adequately and suggests areas that require more research.

Implications and Future Directions

This paper contributes important insights into the current capabilities and limitations of MT systems, suggesting that even relatively straightforward language pairs like English-French require sophisticated handling for optimal translation outcomes. Future research could explore more automatic methods to construct and evaluate challenge sets or designs to improve MT systems' ability to capture linguistic phenomena specifically targeted by challenge sets. Additionally, refining NMT models to learn more systematic linguistic rules and better handle semantic details will be crucial for advancing MT technology.

Conclusion

Through detailed examination and evaluation, this paper provides a thorough understanding of certain linguistic phenomena unresolved by current MT technologies, suggesting improvements and developments in NMT. The challenge set approach offers a complementary tool to traditional metrics, enabling a more granular assessment of MT systems' capabilities beyond general translation quality metrics.