Unsupervised Machine Translation Using Monolingual Corpora Only (1711.00043v2)

Published 31 Oct 2017 in cs.CL and cs.AI

Abstract: Machine translation has recently achieved impressive performance thanks to recent advances in deep learning and the availability of large-scale parallel corpora. There have been numerous attempts to extend these successes to low-resource language pairs, yet requiring tens of thousands of parallel sentences. In this work, we take this research direction to the extreme and investigate whether it is possible to learn to translate even without any parallel data. We propose a model that takes sentences from monolingual corpora in two different languages and maps them into the same latent space. By learning to reconstruct in both languages from this shared feature space, the model effectively learns to translate without using any labeled data. We demonstrate our model on two widely used datasets and two language pairs, reporting BLEU scores of 32.8 and 15.1 on the Multi30k and WMT English-French datasets, without using even a single parallel sentence at training time.

PDF Abstract

Unsupervised Machine Translation Using Monolingual Corpora Only

The paper "Unsupervised Machine Translation Using Monolingual Corpora Only" by Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc'Aurelio Ranzato presents an innovative approach to machine translation (MT) that operates without any parallel sentence pairs during the training process. This research addresses significant challenges in low-resource languages by leveraging available monolingual corpora to construct efficient translation models.

The method proposed involves the training of a translation model that can align and map sentences from two different languages into a common latent space, effectively enabling translation without parallel data. This approach is pertinent not just to elucidate the viability of learning translations in the absence of direct target-source mappings but also to establish a foundational performance benchmark for such unsupervised systems.

Model Architecture and Training

The architecture consists of a single encoder and decoder that handle both languages, differentiated only by the corresponding lookup tables. This model employs elements from sequence-to-sequence architectures with attention mechanisms to enhance the effectiveness of encoding and decoding.

Core to this method is the training process, which involves three primary objectives:

Denoising Auto-Encoding: Training the model to reconstruct a sentence from a noisy version of itself. This step ensures the model captures linguistic features specific to each language.
Cross-Domain Learning: Leveraging translations generated by the current iteration of the model to further refine and improve the translation quality. This process involves reconstructing a sentence in the source language from a noisy translation in the target language, thereby iteratively enhancing the model's dual translation capabilities.
Adversarial Training: Aligning the latent representations of sentences from both languages using a discriminator. This adversarial component ensures that the encoder's output space for both languages remains closely aligned, facilitating more accurate decoding.

Experimental Evaluation

The method was evaluated on two notable datasets: WMT (2014 and 2016) and Multi30k-Task1, covering English-French and English-German translations. Results after three iterations showed impressive BLEU scores:

English-French: 32.76 on Multi30k-Task1; 15.05 on WMT'14.
English-German: 26.26 on Multi30k-Task1; 13.33 on WMT'16.

These results are particularly noteworthy as they approach the quality achieved by supervised MT systems trained on up to 100,000 parallel sentences. This signifies a substantial accomplishment given the unsupervised nature of the training process.

Baseline Comparisons

Several baselines were addressed in the paper:

Word-by-Word Translation (WBW): Using an inferred bilingual dictionary for simple translations given its constraints.
Word Reordering (WR): Enhancing WBW with reordering based on LLMs.
Oracle Word Reordering (OWR): Provided an upper bound, assuming perfect word reordering was possible.
Supervision-Based Models: Conventional supervised training with access to parallel corpora.

Even though these baselines provided a reference, none matched the empirical evidence shown by the proposed method underpinning the potential of unsupervised translation systems.

Implications and Future Directions

This research carries several critical implications. Practically, it paves the way for building translation models in languages with scarce parallel corpora. Theoretically, it reinforces the concept that latent space alignment, combined with adversarial learning, can significantly impact unsupervised machine learning paradigms.

Future developments might involve:

Extending the framework to more varied and lower-resource language pairs.
Integrating more sophisticated noise models and data augmentation techniques to further enhance denoising auto-encoder objectives.
Exploring the integration of Byte Pair Encoding (BPE) to handle issues related to out-of-vocabulary words and improve translation quality.

This paper thus not only sets a precedent in the domain of machine translation but also opens multiple avenues for further enhancing and scaling unsupervised learning methodologies in natural language processing.