Unsupervised Neural Machine Translation (1710.11041v2)

Published 30 Oct 2017 in cs.CL, cs.AI, and cs.LG

Abstract: In spite of the recent success of neural machine translation (NMT) in standard benchmarks, the lack of large parallel corpora poses a major practical problem for many language pairs. There have been several proposals to alleviate this issue with, for instance, triangulation and semi-supervised learning techniques, but they still require a strong cross-lingual signal. In this work, we completely remove the need of parallel data and propose a novel method to train an NMT system in a completely unsupervised manner, relying on nothing but monolingual corpora. Our model builds upon the recent work on unsupervised embedding mappings, and consists of a slightly modified attentional encoder-decoder model that can be trained on monolingual corpora alone using a combination of denoising and backtranslation. Despite the simplicity of the approach, our system obtains 15.56 and 10.21 BLEU points in WMT 2014 French-to-English and German-to-English translation. The model can also profit from small parallel corpora, and attains 21.81 and 15.24 points when combined with 100,000 parallel sentences, respectively. Our implementation is released as an open source project.

PDF Abstract

Unsupervised Neural Machine Translation

The paper "Unsupervised Neural Machine Translation" by Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho addresses a critical challenge in the field of NMT: the dependency on large parallel corpora. While NMT has shown significant advancements over SMT, the requirement for large-scale parallel datasets remains a major limitation, especially for low-resource languages.

Summary of the Approach

The authors propose a novel unsupervised NMT method that eliminates the need for parallel corpora by leveraging monolingual data through two key techniques: denoising and backtranslation. The system is built upon the unsupervised cross-lingual embedding mappings and utilizes a modified attentional encoder-decoder framework. The architecture employs a shared encoder across languages and utilizes fixed cross-lingual embeddings, facilitating bilingual training without explicit parallel data.

Key Components

Shared Encoder with Fixed Cross-Lingual Embeddings: The system uses a universal encoder for both languages, based on cross-lingual word embeddings trained independently and mapped to a shared space. These embeddings remain fixed during training.
Dual Structure: The model handles both translation directions simultaneously, leveraging the dual nature of translation tasks (e.g., French↔English).
Denoising: To prevent the system from learning degenerate copying behavior, the authors introduce noise by randomly swapping adjacent words in the input sentence, forcing the encoder to learn meaningful language-independent representations.
On-the-Fly Backtranslation: The system creates pseudo-parallel corpora by translating sentences in one language to the other using the current state of the model. This step refines the model iteratively, using more realistic translation pairs as training progresses.

Results

The system achieved notable results, attaining BLEU scores of 15.56 for French→English and 10.21 for German→English using only monolingual data. These results significantly outperform a baseline system that relies on word-by-word substitution, demonstrating the model's capacity to capture non-trivial translation relations and produce fluent translations.

When further combined with a small parallel corpus, the model's performance improved to 21.81 and 15.24 BLEU points for French→English and German→English, respectively. This surpasses the comparable NMT trained on the same parallel corpus alone, illustrating the system's effectiveness in leveraging limited parallel data.

Implications and Future Directions

The implications of this research are profound for both practical and theoretical domains in NMT. Practically, the ability to train effective NMT systems without parallel corpora opens new possibilities for translating low-resource languages and creating more equitable AI applications. Theoretically, it showcases the potential of leveraging monolingual corpora through innovative training techniques, such as denoising and backtranslation, to learn complex cross-lingual mappings.

Future research could focus on several aspects:

Relaxing Constraints: Progressive relaxation of the fixed cross-lingual embeddings and shared encoder constraints during training could be explored to enhance performance.
Incorporating Character-Level Information: Addressing rare word translation and named entities systematically by integrating character-level details might mitigate some observed adequacy issues.
Alternative Denoising Functions: Investigating other neighborhood functions for denoising could provide insights, especially in contexts with high typological divergence between languages.

In conclusion, the proposed unsupervised NMT method marks a significant step towards more accessible and efficient machine translation by leveraging monolingual data and innovative training paradigms. Despite the promising results, the paper recognizes that there is substantial room for further optimization and refinement, paving the way for future advancements in the field.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Mikel Artetxe (52 papers)
Gorka Labaka (15 papers)
Eneko Agirre (53 papers)
Kyunghyun Cho (292 papers)

Citations (758)

View on Semantic Scholar