Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unsupervised Statistical Machine Translation (1809.01272v1)

Published 4 Sep 2018 in cs.CL, cs.AI, and cs.LG
Unsupervised Statistical Machine Translation

Abstract: While modern machine translation has relied on large parallel corpora, a recent line of work has managed to train Neural Machine Translation (NMT) systems from monolingual corpora only (Artetxe et al., 2018c; Lample et al., 2018). Despite the potential of this approach for low-resource settings, existing systems are far behind their supervised counterparts, limiting their practical interest. In this paper, we propose an alternative approach based on phrase-based Statistical Machine Translation (SMT) that significantly closes the gap with supervised systems. Our method profits from the modular architecture of SMT: we first induce a phrase table from monolingual corpora through cross-lingual embedding mappings, combine it with an n-gram LLM, and fine-tune hyperparameters through an unsupervised MERT variant. In addition, iterative backtranslation improves results further, yielding, for instance, 14.08 and 26.22 BLEU points in WMT 2014 English-German and English-French, respectively, an improvement of more than 7-10 BLEU points over previous unsupervised systems, and closing the gap with supervised SMT (Moses trained on Europarl) down to 2-5 BLEU points. Our implementation is available at https://github.com/artetxem/monoses

Unsupervised Statistical Machine Translation

The paper "Unsupervised Statistical Machine Translation" by Artetxe et al. addresses the critical challenge of creating machine translation systems using solely monolingual corpora. Traditional approaches to machine translation have conventionally relied on extensive parallel datasets. However, this reliance poses a significant limitation in low-resource settings, where such corpora may not be readily available. The authors propose an innovative solution by leveraging the rigidity and modular architecture of Statistical Machine Translation (SMT) to form an effective unsupervised translation model.

The core contribution of this paper is the development of a phrase-based SMT system that can be trained using only monolingual data. The methodology is built upon extending the skip-gram model to learn cross-lingual n-gram embeddings, which are crucial in constructing a comprehensive phrase table. This is achieved through a self-learning approach that aligns embeddings in a cross-lingual space. The resulting embeddings are incorporated into a traditional SMT framework, complemented by a language and distortion model. The authors introduce an unsupervised variant of Minimum Error Rate Training (MERT) for tuning model weights, significantly enhancing translation quality.

A notable achievement of this system is its performance, evident in empirical results where it scores 14.08 and 26.22 BLEU points on the WMT 2014 English-German and English-French translation tasks, respectively. These results demonstrate an improvement of more than 7-10 BLEU points over prior unsupervised systems and close the gap with supervised SMT models to within 2-5 points. This highlights the potential of phrase-based SMT systems in unsupervised settings, providing a viable alternative to neural machine translation (NMT) approaches that similarly aim to exploit monolingual data.

The iterative backtranslation plays a pivotal role in the system’s refinement, providing further performance improvements. By generating synthetic parallel corpora in an iterative manner, the system enhances its translation capabilities, simulating the benefits of having parallel training data. The authors’ systematic approach, through rigorous experimentation on established datasets, provides robust support for their claims, underscoring the potential of this unsupervised methodology.

The implications of this work are substantial, addressing the needs of many language pairs for which parallel corpora are scarce or non-existent. Theoretically, it sets a promising direction for future developments in machine translation, particularly in extending the framework to semi-supervised scenarios where small amounts of parallel data might be available. Moreover, the prospect of integrating unsupervised SMT systems as a preliminary step in creating synthetic parallel data for training more sophisticated NMT systems opens new avenues for hybrid systems that can capitalize on the strengths of both approaches.

In summary, Artetxe et al. present a compelling case for the use of SMT systems in unsupervised machine translation. Their method not only matches but potentially surpasses the capabilities of recent unsupervised NMT approaches, positioning their model as a crucial contender in the advancement of translation technologies for under-resourced languages. Future work could explore the synergistic possibilities of combining SMT and NMT methodologies to further enhance the state of automated language translation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Mikel Artetxe (52 papers)
  2. Gorka Labaka (15 papers)
  3. Eneko Agirre (53 papers)
Citations (246)