Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On Using Monolingual Corpora in Neural Machine Translation (1503.03535v2)

Published 11 Mar 2015 in cs.CL

Abstract: Recent work on end-to-end neural network-based architectures for machine translation has shown promising results for En-Fr and En-De translation. Arguably, one of the major factors behind this success has been the availability of high quality parallel corpora. In this work, we investigate how to leverage abundant monolingual corpora for neural machine translation. Compared to a phrase-based and hierarchical baseline, we obtain up to $1.96$ BLEU improvement on the low-resource language pair Turkish-English, and $1.59$ BLEU on the focused domain task of Chinese-English chat messages. While our method was initially targeted toward such tasks with less parallel data, we show that it also extends to high resource languages such as Cs-En and De-En where we obtain an improvement of $0.39$ and $0.47$ BLEU scores over the neural machine translation baselines, respectively.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Caglar Gulcehre (71 papers)
  2. Orhan Firat (80 papers)
  3. Kelvin Xu (25 papers)
  4. Kyunghyun Cho (292 papers)
  5. Huei-Chi Lin (1 paper)
  6. Fethi Bougares (18 papers)
  7. Holger Schwenk (35 papers)
  8. Yoshua Bengio (601 papers)
  9. Loic Barrault (4 papers)
Citations (556)

Summary

On Using Monolingual Corpora in Neural Machine Translation

The paper investigates the integration of monolingual corpora into neural machine translation (NMT) systems, focusing on improving translation quality, especially for low-resource languages. Acknowledging the constraints of acquiring high-quality parallel corpora, the authors propose leveraging the abundant availability of monolingual data to enhance NMT performance.

Methodology

The researchers introduce two methods for integrating a LLM (LM) trained on monolingual data into NMT systems: shallow fusion and deep fusion.

  • Shallow Fusion involves combining the scores from the neural translation model (NMT) and the LM at inference time, using a weight coefficient to balance their contributions.
  • Deep Fusion concatenates the hidden states of the NMT decoder and the LM, finetuning the output layer to dynamically integrate information from both sources.

These methods aim to exploit the linguistic structure present in monolingual corpora to improve translation performance.

Experimental Results

The experimental evaluation covers several language pairs: Turkish-English (Tr-En), Chinese-English (Zh-En), German-English (De-En), and Czech-English (Cs-En). The results show:

  • An improvement of up to 1.96 BLEU points on Tr-En using deep fusion.
  • In high-resource settings (Cs-En, De-En), deep fusion enhances performance by 0.39 and 0.47 BLEU points, respectively.

These enhancements indicate that the proposed approaches are not limited to low-resource scenarios, demonstrating their broad applicability.

Analysis

Performance improvements correlate with the domain similarity between monolingual corpora and target translation tasks. Domains with higher similarity, such as news articles for De-En, benefited more significantly from the supplemental LM. This suggests the potential for further advances through domain adaptation.

Implications and Future Work

The research underscores the utility of monolingual data in situations where parallel corpora are scarce, proposing a viable path to boost NMT across various contexts. The paper opens avenues for further exploration in domain adaptation techniques and enhanced LM integration strategies, which could yield even greater gains in translation quality. The insights from this work can inform ongoing developments in AI and language processing, enhancing cross-linguistic communication capabilities.