The LIG system for the English-Czech Text Translation Task of IWSLT 2019 (1911.02898v1)
Abstract: In this paper, we present our submission for the English to Czech Text Translation Task of IWSLT 2019. Our system aims to study how pre-trained LLMs, used as input embeddings, can improve a specialized machine translation system trained on few data. Therefore, we implemented a Transformer-based encoder-decoder neural system which is able to use the output of a pre-trained LLM as input embeddings, and we compared its performance under three configurations: 1) without any pre-trained LLM (constrained), 2) using a LLM trained on the monolingual parts of the allowed English-Czech data (constrained), and 3) using a LLM trained on a large quantity of external monolingual data (unconstrained). We used BERT as external pre-trained LLM (configuration 3), and BERT architecture for training our own LLM (configuration 2). Regarding the training data, we trained our MT system on a small quantity of parallel text: one set only consists of the provided MuST-C corpus, and the other set consists of the MuST-C corpus and the News Commentary corpus from WMT. We observed that using the external pre-trained BERT improves the scores of our system by +0.8 to +1.5 of BLEU on our development set, and +0.97 to +1.94 of BLEU on the test set. However, using our own LLM trained only on the allowed parallel data seems to improve the machine translation performances only when the system is trained on the smallest dataset.
- Loïc Vial (5 papers)
- Benjamin Lecouteux (14 papers)
- Didier Schwab (23 papers)
- Hang Le (9 papers)
- Laurent Besacier (76 papers)