Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multilingual Pretraining Using a Large Corpus Machine-Translated from a Single Source Language (2410.23956v2)

Published 31 Oct 2024 in cs.CL

Abstract: English, as a very high-resource language, enables the pretraining of high-quality LLMs. The same cannot be said for most other languages, as leading LLMs still underperform for non-English languages, likely due to a gap in the quality and diversity of the available multilingual pretraining corpora. In this work, we find that machine-translated text from a single high-quality source language can contribute significantly to the pretraining of multilingual LLMs. We translate FineWeb-Edu, a high-quality English web dataset, into French, German, and Spanish, resulting in a final 300B-token dataset, which we call TransWeb-Edu, and pretrain a 1.3B-parameter model, CuatroLLM, from scratch on this dataset. Across five non-English reasoning tasks, we show that CuatroLLM matches or outperforms state-of-the-art multilingual models trained using closed data, such as Llama3.2 and Gemma2, despite using an order of magnitude less data, such as about 6% of the tokens used for Llama3.2's training. We further demonstrate that with additional domain-specific pretraining, amounting to less than 1% of TransWeb-Edu, CuatroLLM surpasses the state of the art in multilingual reasoning. To promote reproducibility, we release our corpus, models, and training pipeline under open licenses at hf.co/britLLM/CuatroLLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Tower: An open multilingual large language model for translation-related tasks. arXiv preprint arXiv:2402.17733.
  3. Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041.
  4. Smollm-corpus.
  5. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  6. Natural language processing with Python: analyzing text with the natural language toolkit. " O’Reilly Media, Inc.".
  7. Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
  8. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the ninth workshop on statistical machine translation, pages 12–58.
  9. Findings of the 2016 conference on machine translation (wmt16). In First conference on machine translation, pages 131–198. Association for Computational Linguistics.
  10. Andrei Z Broder. 1997. On the resemblance and containment of documents. In Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No. 97TB100171), pages 21–29. IEEE.
  11. Tom B Brown. 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
  12. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  13. A Conneau. 2019. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
  14. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  15. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
  16. The llama 3 herd of models. arXiv preprint arXiv:2407.21783.
  17. Do multilingual language models think better in English? In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 550–564, Mexico City, Mexico. Association for Computational Linguistics.
  18. Croissantllm: A truly bilingual french-english language model. arXiv preprint arXiv:2402.00786.
  19. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395.
  20. Mistral 7b. arXiv preprint arXiv:2310.06825.
  21. Matt Gardner Johannes Welbl, Nelson F. Liu. 2017. Crowdsourcing multiple choice science questions.
  22. The state and fate of linguistic diversity and inclusion in the nlp world. arXiv preprint arXiv:2004.09095.
  23. Andrej Karpathy. 2022. NanoGPT. https://github.com/karpathy/nanoGPT.
  24. Andrej Karpathy. 2024. llm.c. https://github.com/karpathy/llm.c.
  25. Preliminary wmt24 ranking of general mt systems and llms. arXiv preprint arXiv:2407.19884.
  26. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
  27. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles.
  28. Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. arXiv preprint arXiv:2307.16039.
  29. Unsupervised machine translation using monolingual corpora only. arXiv preprint arXiv:1711.00043.
  30. Bloom: A 176b-parameter open-access multilingual language model.
  31. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499.
  32. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  33. Fineweb-edu.
  34. Rephrasing the web: A recipe for compute and data-efficient language modeling. arXiv preprint arXiv:2401.16380.
  35. Understanding and mitigating language confusion in llms. arXiv preprint arXiv:2406.20052.
  36. Eurollm: Multilingual language models for europe. arXiv preprint arXiv:2409.16235.
  37. Meta. 2024. Llama 3.2: Revolutionizing edge AI and vision with open, customizable models — ai.meta.com. [Accessed 15-10-2024].
  38. Fine-tuning large language models for adaptive machine translation. arXiv preprint arXiv:2312.12740.
  39. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36.
  40. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  41. The fineweb datasets: Decanting the web for the finest text data at scale. Preprint, arXiv:2406.17557.
  42. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  43. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
  44. mgpt: Few-shot learners go multilingual. arXiv preprint arXiv:2204.07580.
  45. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
  46. Antoine Simoulin and Benoit Crabbé. 2021. Un modèle transformer génératif pré-entrainé pour le _ français. In Traitement Automatique des Langues Naturelles, pages 246–255. ATALA.
  47. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  48. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118.
  49. Together Computer. 2023. Redpajama: an open dataset for training large language models.
  50. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  51. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  52. Qwen2 technical report. arXiv preprint arXiv:2407.10671.
  53. PAWS-X: A cross-lingual adversarial dataset for paraphrase identification. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3687–3692, Hong Kong, China. Association for Computational Linguistics.
  54. Mammoth2: Scaling instructions from the web. arXiv preprint arXiv:2405.03548.
  55. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.
  56. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets