Incorporating BERT into Neural Machine Translation (2002.06823v1)

Published 17 Feb 2020 in cs.CL

Abstract: The recently proposed BERT has shown great power on a variety of natural language understanding tasks, such as text classification, reading comprehension, etc. However, how to effectively apply BERT to neural machine translation (NMT) lacks enough exploration. While BERT is more commonly used as fine-tuning instead of contextual embedding for downstream language understanding tasks, in NMT, our preliminary exploration of using BERT as contextual embedding is better than using for fine-tuning. This motivates us to think how to better leverage BERT for NMT along this direction. We propose a new algorithm named BERT-fused model, in which we first use BERT to extract representations for an input sequence, and then the representations are fused with each layer of the encoder and decoder of the NMT model through attention mechanisms. We conduct experiments on supervised (including sentence-level and document-level translations), semi-supervised and unsupervised machine translation, and achieve state-of-the-art results on seven benchmark datasets. Our code is available at \url{https://github.com/bert-nmt/bert-nmt}.

PDF Abstract

Incorporating BERT into Neural Machine Translation

The paper investigates the integration of BERT, a prominent pre-trained LLM, into Neural Machine Translation (NMT). The authors propose a BERT-fused model that utilizes BERT's ability to provide context-aware representations, aiming to enhance NMT performance across various settings, including supervised, semi-supervised, and unsupervised translation tasks.

Research Motivation

BERT has significantly improved results in natural language understanding tasks; however, its application in NMT remains underexplored. Traditional strategies like using BERT for initializing NMT models or as input embeddings have shown limited success. This paper explores an alternative approach by fusing BERT representations with the encoder and decoder layers of the NMT model through attention mechanisms.

Methodology

The proposed BERT-fused model extracts representations from BERT and incorporates them into all layers of the NMT encoder and decoder. Attention mechanisms are employed to effectively blend BERT's features with NMT's inherent sequential processing capabilities. This design addresses issues such as differing tokenization schemes between BERT and NMT modules. Additionally, the paper introduces a novel drop-net trick to regularize training by probabilistically using BERT-encoder or self-attention outputs, thereby reducing overfitting.

Experimental Evaluation

The authors conduct extensive experiments across different translation tasks to validate their approach:

Supervised NMT: Experiments cover low-resource (IWSLT datasets) and high-resource (WMT datasets) scenarios. The BERT-fused model achieved a BLEU score of 36.11 on the IWSLT'14 De-to-En task, surpassing previous methods.
Document-Level NMT: Utilizing BERT's sentence relations, the model improves translation by incorporating document-level context.
Semi-Supervised NMT: The approach demonstrates significant improvements when combined with back-translation methods, achieving state-of-the-art results on the WMT'16 Romanian-to-English task.
Unsupervised NMT: The model outperforms previous unsupervised methods on En↔Fr and En↔Ro tasks by effectively leveraging XLM, a variant of BERT.

Results and Implications

The BERT-fused model consistently outperformed baselines across all tested datasets, setting new benchmarks for various translation tasks. This work illustrates the practical benefits of combining pre-trained LLMs with NMT, reaffirming BERT's utility beyond mere language understanding. The paper maintains a comprehensive code repository, facilitating reproducibility and further research.

Future Directions

The authors acknowledge potential improvements in reducing the additional storage and inference time costs introduced by BERT integration. Future work could focus on optimizing such trade-offs and exploring BERT incorporation in other AI domains, such as question answering. Additionally, investigating model compression techniques could lead to more efficient deployment of BERT-enhanced NMT systems.

This research contributes substantially to the field of NMT by effectively exploiting pre-trained LLMs, particularly highlighting BERT's potential in enhancing translation quality through innovative architectural integration.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Jinhua Zhu (28 papers)
Yingce Xia (53 papers)
Lijun Wu (113 papers)
Di He (108 papers)
Tao Qin (201 papers)
Wengang Zhou (153 papers)
Houqiang Li (236 papers)
Tie-Yan Liu (242 papers)

Citations (329)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - bert-nmt/bert-nmt (358 stars)