Evaluating Discourse Phenomena in Neural Machine Translation (1711.00513v3)

Published 1 Nov 2017 in cs.CL

Abstract: For machine translation to tackle discourse phenomena, models must have access to extra-sentential linguistic context. There has been recent interest in modelling context in neural machine translation (NMT), but models have been principally evaluated with standard automatic metrics, poorly adapted to evaluating discourse phenomena. In this article, we present hand-crafted, discourse test sets, designed to test the models' ability to exploit previous source and target sentences. We investigate the performance of recently proposed multi-encoder NMT models trained on subtitles for English to French. We also explore a novel way of exploiting context from the previous sentence. Despite gains using BLEU, multi-encoder models give limited improvement in the handling of discourse phenomena: 50% accuracy on our coreference test set and 53.5% for coherence/cohesion (compared to a non-contextual baseline of 50%). A simple strategy of decoding the concatenation of the previous and current sentence leads to good performance, and our novel strategy of multi-encoding and decoding of two sentences leads to the best performance (72.5% for coreference and 57% for coherence/cohesion), highlighting the importance of target-side context.

Authors (4)

Rachel Bawden (25 papers)
Rico Sennrich (88 papers)
Alexandra Birch (67 papers)
Barry Haddow (59 papers)

Citations (255)

View on Semantic Scholar

Summary

Evaluating Discourse Phenomena in Neural Machine Translation

The paper in question provides an insightful dissection of the challenges and methodologies surrounding discourse phenomena within Neural Machine Translation (NMT), specifically focusing on context-dependent translation tasks such as coreference resolution and lexical cohesion. The paper's primary aim is to evaluate NMT models' capabilities in correctly leveraging extra-sentential context, which is essential for translating discursive elements accurately.

Overview

Neural machine translation typically processes sentences in isolation, neglecting the broader discourse context that often influences correct translation. The paper highlights that improvements in discursive elements are frequently overlooked by traditional metrics like BLEU, which emphasize lexical similarity over contextual accuracy. The authors propose that multi-encoder NMT models stand as promising candidates for integrating such linguistic context into the translation process.

To scrutinize the efficacy of various NMT models on these discourse-specific tasks, the researchers developed hand-crafted test sets evaluating their capacity to process coreference and cohesion/coherence in the English-to-French translation. The use of these contrastive test sets allows for a focused examination of the models' ability to utilize context from preceding sentences on both the source and target side.

Methodology

Several NMT architectures were evaluated, including single and multi-encoder models. The baseline model, devoid of contextual input, served as a control, while contextual models incorporated additional inputs, such as prior source or target sentences. Key strategies examined included:

Single-Encoder with Input Concatenation (2-to-2 and 2-to-1 models): These models concatenate the previous sentence with the current one in the input. The 2-to-2 model translates both previous and current sentences simultaneously, while the 2-to-1 model focuses directly on the translation of the current sentence.
Multi-Encoder Architectures: Different combination strategies, such as attention-based concatenation, hierarchical attention, and attention gating, were employed to utilize the previous sentence.
Novel Strategy: The novel approach of hierarchically combining multi-source inputs and decoding concatenated output sentences demonstrated notable gains in handling discourse phenomena.

Results and Discussion

The empirical findings are nuanced. While many single-encoder and multi-encoder strategies yielded no improvement over the baseline for discourse phenomena, the paper's novel strategy, employing hierarchical attention coupled with dual sentence decoding, achieved significant enhancements. Specifically, this approach pushed coreference accuracy to 72.5% and coherence/cohesion to 57%, underscoring the potential of incorporating decoder-side context within the translation process.

The authors argue that the recurrent state of the decoder encapsulates discursive context more effectively than merely encoding it on the input side. This crucial insight suggests that maintaining continuity in contextual information through the decoder might be a promising research trajectory.

Implications and Future Directions

This paper signals a pivotal shift towards appreciating discourse-level context in NMT as a means to improve translation fidelity. By dissecting these phenomena using bespoke evaluation metrics and architectural innovations, there are broader implications for theoretical advancements and practical applications in AI-driven translation tools. Future research directions might explore more sophisticated contextual integration techniques, such as stream decoding, to further enhance discourse-level translation accuracy.

Such efforts would not only enrich the theoretical foundations of NMT but also provide more robust machine translation systems capable of nuanced, context-sensitive translations akin to human proficiency. The research underscores a movement toward holistic approaches in AI, mirroring the complex layers of human linguistic capabilities.