Papers
Topics
Authors
Recent
2000 character limit reached

Should we Stop Training More Monolingual Models, and Simply Use Machine Translation Instead?

Published 21 Apr 2021 in cs.CL and cs.LG | (2104.10441v1)

Abstract: Most work in NLP makes the assumption that it is desirable to develop solutions in the native language in question. There is consequently a strong trend towards building native LLMs even for low-resource languages. This paper questions this development, and explores the idea of simply translating the data into English, thereby enabling the use of pretrained, and large-scale, English LLMs. We demonstrate empirically that a large English LLM coupled with modern machine translation outperforms native LLMs in most Scandinavian languages. The exception to this is Finnish, which we assume is due to inferior translation quality. Our results suggest that machine translation is a mature technology, which raises a serious counter-argument for training native LLMs for low-resource languages. This paper therefore strives to make a provocative but important point. As English LLMs are improving at an unprecedented pace, which in turn improves machine translation, it is from an empirical and environmental stand-point more effective to translate data from low-resource languages into English, than to build LLMs for such languages.

Citations (22)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.