Should we Stop Training More Monolingual Models, and Simply Use Machine Translation Instead?

Published 21 Apr 2021 in cs.CL and cs.LG | (2104.10441v1)

Abstract: Most work in NLP makes the assumption that it is desirable to develop solutions in the native language in question. There is consequently a strong trend towards building native LLMs even for low-resource languages. This paper questions this development, and explores the idea of simply translating the data into English, thereby enabling the use of pretrained, and large-scale, English LLMs. We demonstrate empirically that a large English LLM coupled with modern machine translation outperforms native LLMs in most Scandinavian languages. The exception to this is Finnish, which we assume is due to inferior translation quality. Our results suggest that machine translation is a mature technology, which raises a serious counter-argument for training native LLMs for low-resource languages. This paper therefore strives to make a provocative but important point. As English LLMs are improving at an unprecedented pace, which in turn improves machine translation, it is from an empirical and environmental stand-point more effective to translate data from low-resource languages into English, than to build LLMs for such languages.