Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language (2410.23956v2)
Abstract: English, as a very high-resource language, enables the pretraining of high-quality LLMs. The same cannot be said for most other languages, as leading LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated text from a single high-quality source language can contribute significantly to the pretraining of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into French, German, and Spanish, resulting in a final 300B-token dataset, which we call TransWeb-Edu, and pretrain a 1.3B-parameter model, CuatroLLM, from scratch on this dataset. Across five non-English reasoning tasks, we show that CuatroLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2 and Gemma2, despite using an order of magnitude less data, such as about 6% of the tokens used for Llama3.2's training. We further demonstrate that with additional domain-specific pretraining, amounting to less than 1% of TransWeb-Edu, CuatroLLM surpasses the state of the art in multilingual reasoning. To promote reproducibility, we release our corpus, models, and training pipeline under open licenses at hf.co/britLLM/CuatroLLM.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Tower: An open multilingual large language model for translation-related tasks. arXiv preprint arXiv:2402.17733.
- Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041.
- Smollm-corpus.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
- Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.".
- Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
- Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pages 12–58.
- Findings of the 2016 conference on machine translation (wmt16). In First conference on machine translation, pages 131–198. Association for Computational Linguistics.
- Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE.
- Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
- A Conneau. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
- Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
- Do multilingual language models think better in English? In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 550–564, Mexico City, Mexico. Association for Computational Linguistics.
- Croissantllm: A truly bilingual french-english language model. arXiv preprint arXiv:2402.00786.
- Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Matt Gardner Johannes Welbl, Nelson F. Liu. 2017. Crowdsourcing multiple choice science questions.
- The state and fate of linguistic diversity and inclusion in the nlp world. arXiv preprint arXiv:2004.09095.
- Andrej Karpathy. 2022. NanoGPT. https://github.com/karpathy/nanoGPT.
- Andrej Karpathy. 2024. llm.c. https://github.com/karpathy/llm.c.
- Preliminary wmt24 ranking of general mt systems and llms. arXiv preprint arXiv:2407.19884.
- Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
- Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
- Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. arXiv preprint arXiv:2307.16039.
- Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043.
- Bloom: A 176b-parameter open-access multilingual language model.
- Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
- Fineweb-edu.
- Rephrasing the web: A recipe for compute and data-efficient language modeling. arXiv preprint arXiv:2401.16380.
- Understanding and mitigating language confusion in llms. arXiv preprint arXiv:2406.20052.
- Eurollm: Multilingual language models for europe. arXiv preprint arXiv:2409.16235.
- Meta. 2024. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models — ai.meta.com. [Accessed 15-10-2024].
- Fine-tuning large language models for adaptive machine translation. arXiv preprint arXiv:2312.12740.
- Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- The fineweb datasets: Decanting the web for the finest text data at scale. Preprint, arXiv:2406.17557.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
- Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
- mgpt: Few-shot learners go multilingual. arXiv preprint arXiv:2204.07580.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
- Antoine Simoulin and Benoit Crabbé. 2021. Un modèle transformer génératif pré-entrainé pour le _ français. In Traitement Automatique des Langues Naturelles, pages 246–255. ATALA.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118.
- Together Computer. 2023. Redpajama: an open dataset for training large language models.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
- Qwen2 technical report. arXiv preprint arXiv:2407.10671.
- PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3687–3692, Hong Kong, China. Association for Computational Linguistics.
- Mammoth2: Scaling instructions from the web. arXiv preprint arXiv:2405.03548.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
- Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.