Dynamic data sampler for cross-language transfer learning in large language models (2405.10626v1)
Abstract: LLMs have gained significant attention in the field of NLP due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese LLMs in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese LLM. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.
- “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- “The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only,” arXiv preprint arXiv:2306.01116, 2023.
- “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
- “Glm-130b: An open bilingual pre-trained model,” arXiv preprint arXiv:2210.02414, 2022.
- “Curriculum learning,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 41–48.
- “Measuring massive multitask language understanding,” arXiv preprint arXiv:2009.03300, 2020.
- “C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models,” arXiv preprint arXiv:2305.08322, 2023.
- “Cmmlu: Measuring massive multitask language understanding in chinese,” arXiv preprint arXiv:2306.09212, 2023.
- “Evaluating the performance of large language models on gaokao benchmark,” arXiv preprint arXiv:2305.12474, 2023.
- “Superclue: A comprehensive chinese large language model benchmark,” arXiv preprint arXiv:2307.15020, 2023.
- “Paracrawl: Web-scale parallel corpora for the languages of the eu,” in Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks, 2019, pp. 118–119.
- “Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 1351–1361.
- “Cluecorpus2020: A large-scale chinese corpus for pre-training language model,” arXiv preprint arXiv:2003.01355, 2020.
- “Csl: A large-scale chinese scientific literature dataset,” in Proceedings of the 29th International Conference on Computational Linguistics, 2022, pp. 3917–3923.
- Yan Gong Yiping Peng Qiang Niu Baochang Ma Yunjie Ji, Yong Deng and Xiangang Li, “Belle: Be everyone’s large language model engine,” \urlhttps://github.com/LianjiaTech/BELLE, 2023.
- “Enhancing chat language models by scaling high-quality instructional conversations,” arXiv preprint arXiv:2305.14233, 2023.
- “Finetuned language models are zero-shot learners,” in International Conference on Learning Representations.
- “Chinese open instruction generalist: A preliminary release,” 2023.
- “TencentPretrain: A scalable and flexible toolkit for pre-training models of different modalities,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Toronto, Canada, July 2023, pp. 217–225, Association for Computational Linguistics.
- Yudong Li (19 papers)
- Yuhao Feng (3 papers)
- Wen Zhou (38 papers)
- Zhe Zhao (97 papers)
- Linlin Shen (133 papers)
- Cheng Hou (4 papers)
- Xianxu Hou (24 papers)