Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multilingual Language Model Pretraining using Machine-translated Data (2502.13252v1)

Published 18 Feb 2025 in cs.CL

Abstract: High-resource languages such as English, enables the pretraining of high-quality LLMs. The same can not be said for most other languages as LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated texts from a single high-quality source language can contribute significantly to the pretraining quality of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into nine languages, resulting in a 1.7-trillion-token dataset, which we call TransWebEdu and pretrain a 1.3B-parameter model, TransWebLLM, from scratch on this dataset. Across nine non-English reasoning tasks, we show that TransWebLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2, Qwen2.5, and Gemma, despite using an order of magnitude less data. We demonstrate that adding less than 5% of TransWebEdu as domain-specific pretraining data sets a new state-of-the-art in Arabic, Italian, Indonesian, Swahili, and Welsh understanding and commonsense reasoning tasks. To promote reproducibility, we release our corpus, models, and training pipeline under Open Source Initiative-approved licenses.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jiayi Wang (74 papers)
  2. Yao Lu (212 papers)
  3. Maurice Weber (15 papers)
  4. Max Ryabinin (29 papers)
  5. David Adelani (7 papers)
  6. Yihong Chen (34 papers)
  7. Raphael Tang (32 papers)
  8. Pontus Stenetorp (68 papers)

Summary

We haven't generated a summary for this paper yet.