Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Breaking Language Barriers: Cross-Lingual Continual Pre-Training at Scale (2407.02118v2)

Published 2 Jul 2024 in cs.CL

Abstract: In recent years, LLMs have made significant strides towards Artificial General Intelligence. However, training these models from scratch requires substantial computational resources and vast amounts of text data. In this paper, we explore an alternative approach to constructing an LLM for a new language by continually pretraining (CPT) from existing pretrained LLMs, instead of using randomly initialized parameters. Based on parallel experiments on 40 model sizes ranging from 40M to 5B parameters, we find that 1) CPT converges faster and saves significant resources in a scalable manner; 2) CPT adheres to an extended scaling law derived from Hoffmann et al. (2022) with a joint data-parameter scaling term; 3) The compute-optimal data-parameter allocation for CPT markedly differs based on our estimated scaling factors; 4) The effectiveness of transfer at scale is influenced by training duration and linguistic properties, while robust to data replaying, a method that effectively mitigates catastrophic forgetting in CPT. We hope our findings provide deeper insights into the transferability of LLMs at scale for the research community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Piqa: Reasoning about physical commonsense in natural language. ArXiv, abs/1911.11641.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Together Computer. 2023. Redpajama: an open dataset for training large language models.
  5. Xnli: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  6. Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. arXiv e-prints, pages arXiv–2307.
  7. Zero-shot cross-lingual transfer language selection using linguistic similarity. Information Processing & Management, 60(3):103250.
  8. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pages 3762–3773. PMLR.
  9. Scaling laws for sparsely-connected foundation models. arXiv preprint arXiv:2309.08520.
  10. Continual pre-training of large language models: How to (re) warm your model? arXiv preprint arXiv:2308.04014.
  11. Scaling laws and interpretability of learning from repeated data. arXiv preprint arXiv:2205.10487.
  12. Scaling laws for transfer. arXiv preprint arXiv:2102.01293.
  13. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  14. Peter J Huber. 1992. Robust estimation of a location parameter. In Breakthroughs in statistics: Methodology and distribution, pages 492–518. Springer.
  15. Simple and scalable strategies to continually pre-train large language models. arXiv preprint arXiv:2403.08763.
  16. Damjan Kalajdzievski. 2024. Scaling laws for forgetting when fine-tuning large language models. arXiv preprint arXiv:2401.05605.
  17. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  18. Scaling laws for fine-grained mixture of experts. arXiv preprint arXiv:2402.07871.
  19. Few-shot learning with multilingual generative language models. In Conference on Empirical Methods in Natural Language Processing.
  20. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264.
  21. Jorge Nocedal. 1980. Updating quasi-newton matrices with limited storage. Mathematics of computation, 35(151):773–782.
  22. Continual learning with foundation models: An empirical study of latent replay. In Conference on lifelong learning agents, pages 60–91. PMLR.
  23. Sinno Jialin Pan and Qiang Yang. 2009. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359.
  24. Xcopa: A multilingual dataset for causal commonsense reasoning. In Conference on Empirical Methods in Natural Language Processing.
  25. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  26. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints.
  27. Winogrande. Communications of the ACM, 64:99 – 106.
  28. Jonathan W Siegel and Jinchao Xu. 2020. Approximation rates for neural networks with general activation functions. Neural Networks, 128:313–321.
  29. A survey on deep transfer learning. In Artificial Neural Networks and Machine Learning–ICANN 2018: 27th International Conference on Artificial Neural Networks, Rhodes, Greece, October 4-7, 2018, Proceedings, Part III 27, pages 270–279. Springer.
  30. Multilingual translation with extensible multilingual pretraining and finetuning. arXiv preprint arXiv:2008.00401.
  31. Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  33. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
  34. Overcoming catastrophic forgetting in massively multilingual continual learning. arXiv preprint arXiv:2305.16252.
  35. Emerging cross-lingual structure in pretrained language models. arXiv preprint arXiv:1911.01464.
  36. How transferable are features in deep neural networks? Advances in neural information processing systems, 27.
  37. When scaling meets llm finetuning: The effect of data, model and finetuning method. arXiv preprint arXiv:2402.17193.
  38. A comprehensive survey on transfer learning. Proceedings of the IEEE, 109(1):43–76.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Wenzhen Zheng (5 papers)
  2. Wenbo Pan (4 papers)
  3. Xu Xu (57 papers)
  4. Libo Qin (77 papers)
  5. Li Yue (19 papers)
  6. Ming Zhou (182 papers)