Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Augmentation for Low-Resource Neural Machine Translation (1705.00440v1)

Published 1 May 2017 in cs.CL
Data Augmentation for Low-Resource Neural Machine Translation

Abstract: The quality of a Neural Machine Translation system depends substantially on the availability of sizable parallel corpora. For low-resource language pairs this is not the case, resulting in poor translation quality. Inspired by work in computer vision, we propose a novel data augmentation approach that targets low-frequency words by generating new sentence pairs containing rare words in new, synthetically created contexts. Experimental results on simulated low-resource settings show that our method improves translation quality by up to 2.9 BLEU points over the baseline and up to 3.2 BLEU over back-translation.

Data Augmentation for Low-Resource Neural Machine Translation

The paper by Fadaee, Bisazza, and Monz explores a data augmentation technique aimed at improving neural machine translation (NMT) for low-resource language pairs. The primary challenge in this domain is the scarcity of parallel corpora, which traditionally inhibits the effective training of NMT models.

Key Contributions

The authors introduce a novel data augmentation methodology inspired by techniques in computer vision. The approach targets low-frequency words by generating new sentence pairs containing these rare words within synthetically created contexts. The rationale behind this is that rare word occurrences pose significant challenges to parameter estimation in NMT models, which is exacerbated in low-resource scenarios.

Methodology

The proposed technique, termed Translation Data Augmentation (TDA), adjusts existing parallel corpus sentences by substituting words with rare counterparts, both in source and target languages, while maintaining semantic equivalence. The main steps include:

  1. Targeted Words Selection: Identifying rare words from the vocabulary, which occur below a specific frequency threshold.
  2. Rare Word Substitution: Using LSTM-based LLMs to suggest plausible rare word substitutions in the context of existing sentences, ensuring fluency and grammatical correctness.
  3. Translation Selection: Aligning words using automatic methods to maintain translation fidelity, relying on a bilingual lexicon and word alignment probabilities.
  4. Sampling: Iteratively creating new sentences while ensuring diverse rare word occurrences.

Experimental Setup and Results

Experiments were conducted in a simulated low-resource environment using data from the WMT15 English-German dataset. The results highlighted significant BLEU score improvements, up to 2.9 points over baseline performance. The method was compared with the back-translation technique, where TDA demonstrated superior performance, especially in increasing the generation of rare words during translation. Additionally, it was noted that altering multiple words in a sentence proved marginally more effective, suggesting that addressing a broader array of rare words is beneficial.

Implications and Future Directions

The implications of this research are substantial for the development of NMT systems for low-resource languages, where manual collection of large parallel corpora is infeasible. The approach allows for more comprehensive modeling of rare words, increasing the confidence in word alignment and translation generation.

Future directions could explore further refinement of the LLMs to improve substitution quality, potentially incorporating more advanced neural architectures or additional linguistic features. Expanding scope to more diverse languages and incorporating multilingual resources could also enhance the robustness and applicability of this approach.

The research contributes to the theoretical understanding of data augmentation in NLP and offers practical avenues for enhancing NMT performance in low-resource settings, paving the way for more inclusive global language representation in AI models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Marzieh Fadaee (40 papers)
  2. Arianna Bisazza (43 papers)
  3. Christof Monz (53 papers)
Citations (449)