Exploiting Similarities among Languages for Machine Translation (1309.4168v1)

Published 17 Sep 2013 in cs.CL

Abstract: Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Our method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data. It uses distributed representation of words and learns a linear mapping between vector spaces of languages. Despite its simplicity, our method is surprisingly effective: we can achieve almost 90% precision@5 for translation of words between English and Spanish. This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs.

Authors (3)

Tomas Mikolov (43 papers)
Quoc V. Le (128 papers)
Ilya Sutskever (58 papers)

Citations (1,574)

View on Semantic Scholar

Summary

The paper introduces a method using linear mappings between word embeddings to enhance translation accuracy, achieving approximately 51% precision@5.
It leverages large monolingual corpora with Skip-gram and CBOW models to learn distributed word representations that capture semantic and syntactic patterns.
The approach is scalable and beneficial for low-resource languages by requiring only a small bilingual seed dictionary to align vector spaces.

Exploiting Similarities among Languages for Machine Translation

The paper "Exploiting Similarities among Languages for Machine Translation," authored by Tomas Mikolov, Quoc V. Le, and Ilya Sutskever from Google Inc., introduces an innovative approach to automating the generation and extension of dictionaries and phrase tables central to statistical machine translation (SMT) systems. Leveraging large monolingual datasets, the proposed method seeks to fill in missing word and phrase translations by modeling language structures and learning linear mappings between languages from small bilingual datasets.

Methodological Overview

The core strategy hinges on distributed representations of words, which are used to learn a linear projection between vector spaces representing different languages. This process consists of two main stages:

Monolingual Model Training: Using substantial monolingual text corpora, word representations are learned through models like Skip-gram and Continuous Bag-of-Words (CBOW).
Bilingual Mapping: A small bilingual dictionary is utilized to learn a linear mapping between the language vector spaces.

Once the mapping is learned, it can project any word vector from the source language to the target language vector space. The target word whose vector is closest to the projected vector is selected as the translation. Figure~\ref{fig:linear-translation} in the paper illustrates this concept through PCA visualization, showing similar geometric arrangements of word vectors for numbers and animals in English and Spanish, suggesting that an accurate linear mapping can be learned due to shared real-world concepts.

Models and Training

The Skip-gram and CBOW models, depicted in Figure~\ref{fig:skipgram}, are central to the proposed method. These models train on predicting the context for a given word (Skip-gram) or predicting a word from its context (CBOW). The computational efficiency of these models allows them to scale to billions of words, capturing significant semantic information and linguistic regularities, including linear relationships among words.

When trained on large datasets, these models capture extensive semantic relations—linear transformations such as "king" - "man" + "woman" approximate "queen."

Translation via Linear Mapping

The paper formalizes the translation problem by seeking a transformation matrix $W$ that aligns the source and target language word vectors. The linear projection from the source language vector space to the target language vector space is learned by minimizing the mean squared error between the transformed vectors.

Experimental Evaluation

Various experiments were conducted using WMT11 datasets and large corpora, demonstrating the method's efficacy across different languages, including English, Spanish, and Czech. Key findings include:

Effectiveness: Achieved precision@5 of approximately 51% for English to Spanish translations using the proposed linear transformation.
Robustness: Demonstrated capability in translating between linguistically distant languages (e.g., English to Czech).
Scalability: Showcased improved translation accuracy with increasing sizes of training data from monolingual sources.

Tables \ref{tab:stats} and \ref{tab:wmt11results} detail the sizes of monolingual training datasets and the accuracy of different translation methods, respectively.

Practical and Theoretical Implications

The implications of this research are multifaceted:

Dictionaries and Phrase Tables: The method provides a means to enhance and expand existing dictionaries and SMT systems, offering a translational precision metric for filtering and augmenting entries.
Low-Resource Languages: By needing only a small bilingual dictionary, the approach holds promise for low-resource languages and domains with scant parallel corpora.
Semantic Understanding: Leveraging distributed representations that inherently capture semantic and syntactic relationships offers a robust framework for machine translation beyond mere word-for-word substitution.

Future Directions

Potential avenues for future research include:

Application to Other Languages: Further exploration into additional language pairs, particularly those with significantly different linguistic structures.
Enhanced Models: Development of more sophisticated models that can better capture nuances in infrequent words and complex phrase structures.
Integration with Existing Systems: Combining this approach with current SMT systems to assess holistic performance improvements in real-world scenarios.

In conclusion, the paper introduces a methodologically solid and computationally efficient technique for enhancing machine translation by exploiting the intrinsic similarities among languages through distributed word representations and linear mappings. The empirical results substantiate the method's utility and pave the way for significant advancements in the development of multilingual NLP systems.

PDF Markdown

Related Papers

YouTube

Show All Videos