Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-lingual Retrieval for Iterative Self-Supervised Training (2006.09526v2)

Published 16 Jun 2020 in cs.CL and cs.LG

Abstract: Recent studies have demonstrated the cross-lingual alignment ability of multilingual pretrained LLMs. In this work, we found that the cross-lingual alignment can be further improved by training seq2seq models on sentence pairs mined using their own encoder outputs. We utilized these findings to develop a new approach -- cross-lingual retrieval for iterative self-supervised training (CRISS), where mining and training processes are applied iteratively, improving cross-lingual alignment and translation ability at the same time. Using this method, we achieved state-of-the-art unsupervised machine translation results on 9 language directions with an average improvement of 2.4 BLEU, and on the Tatoeba sentence retrieval task in the XTREME benchmark on 16 languages with an average improvement of 21.5% in absolute accuracy. Furthermore, CRISS also brings an additional 1.8 BLEU improvement on average compared to mBART, when finetuned on supervised machine translation downstream tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chau Tran (13 papers)
  2. Yuqing Tang (12 papers)
  3. Xian Li (115 papers)
  4. Jiatao Gu (83 papers)
Citations (72)