Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Universal Neural Machine Translation for Extremely Low Resource Languages (1802.05368v2)

Published 15 Feb 2018 in cs.CL
Universal Neural Machine Translation for Extremely Low Resource Languages

Abstract: In this paper, we propose a new universal machine translation approach focusing on languages with a limited amount of parallel data. Our proposed approach utilizes a transfer-learning approach to share lexical and sentence level representations across multiple source languages into one target language. The lexical part is shared through a Universal Lexical Representation to support multilingual word-level sharing. The sentence-level sharing is represented by a model of experts from all source languages that share the source encoders with all other languages. This enables the low-resource language to utilize the lexical and sentence representations of the higher resource languages. Our approach is able to achieve 23 BLEU on Romanian-English WMT2016 using a tiny parallel corpus of 6k sentences, compared to the 18 BLEU of strong baseline system which uses multilingual training and back-translation. Furthermore, we show that the proposed approach can achieve almost 20 BLEU on the same dataset through fine-tuning a pre-trained multi-lingual system in a zero-shot setting.

Universal Neural Machine Translation for Extremely Low Resource Languages

The paper, "Universal Neural Machine Translation for Extremely Low Resource Languages," addresses a critical challenge faced by Neural Machine Translation (NMT) systems: the lack of sufficient parallel data for many language pairs. The authors propose a universal NMT approach leveraging transfer learning to enhance translation capabilities for languages with limited resources, through sharing lexical and sentence representations across multiple source languages converging into a single target language.

Central to their approach are two components: Universal Lexical Representation (ULR) and Mixture of Language Experts (MoLE). ULR promotes multi-lingual word-level sharing by mapping words from different languages into a universal token space, facilitating cross-lingual learning. This is achieved by using monolingual embeddings aligned to a shared semantic space, thus allowing semantically similar words from different languages to have similar representations. MoLE focuses on sentence-level sharing, using expert networks for each language to share sentence representations, enabling low-resource languages to benefit from high-resource data.

The authors evaluate their method on Romanian-English, Korean-English, and Latvian-English language pairs, in extremely low-resource contexts. Notably, the proposed model achieves a BLEU score of 23 on Romanian-English translation using a corpus of only 6,000 sentences, outperforming traditional multi-lingual baselines scoring 18 BLEU. Their model also shows effectiveness in zero-shot translation scenarios, achieving nearly 20 BLEU by fine-tuning a pre-trained multi-lingual system.

The paper highlights several critical implications. Practically, it expands the applicability of NMT to languages with minimal parallel data, potentially amplifying the reach of language technologies globally. Theoretically, it provides a framework to explore further the utility of semantic alignment and cross-linguistic representation learning in machine translation.

Moving forward, the research could stimulate developments in unsupervised and minimally supervised machine translation, exploring avenues like zero-resource applications, unsupervised dictionary alignment, and meta-learning. The strategies demonstrated could synergize with advancements in unsupervised learning, leading to more robust systems that can effectively exploit monolingual data for translation tasks. Ultimately, this work contributes a foundational step towards more inclusive language technologies, ensuring linguistic diversity receives broader computational support.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jiatao Gu (83 papers)
  2. Hany Hassan (11 papers)
  3. Jacob Devlin (24 papers)
  4. Victor O. K. Li (56 papers)
Citations (266)