Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dynamic data sampler for cross-language transfer learning in large language models (2405.10626v1)

Published 17 May 2024 in cs.CL

Abstract: LLMs have gained significant attention in the field of NLP due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese LLMs in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese LLM. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)
  1. “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  2. “The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only,” arXiv preprint arXiv:2306.01116, 2023.
  3. “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  4. “Glm-130b: An open bilingual pre-trained model,” arXiv preprint arXiv:2210.02414, 2022.
  5. “Curriculum learning,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 41–48.
  6. “Measuring massive multitask language understanding,” arXiv preprint arXiv:2009.03300, 2020.
  7. “C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models,” arXiv preprint arXiv:2305.08322, 2023.
  8. “Cmmlu: Measuring massive multitask language understanding in chinese,” arXiv preprint arXiv:2306.09212, 2023.
  9. “Evaluating the performance of large language models on gaokao benchmark,” arXiv preprint arXiv:2305.12474, 2023.
  10. “Superclue: A comprehensive chinese large language model benchmark,” arXiv preprint arXiv:2307.15020, 2023.
  11. “Paracrawl: Web-scale parallel corpora for the languages of the eu,” in Proceedings of Machine Translation Summit XVII: Translator, Project and User Tracks, 2019, pp. 118–119.
  12. “Wikimatrix: Mining 135m parallel sentences in 1620 language pairs from wikipedia,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 1351–1361.
  13. “Cluecorpus2020: A large-scale chinese corpus for pre-training language model,” arXiv preprint arXiv:2003.01355, 2020.
  14. “Csl: A large-scale chinese scientific literature dataset,” in Proceedings of the 29th International Conference on Computational Linguistics, 2022, pp. 3917–3923.
  15. Yan Gong Yiping Peng Qiang Niu Baochang Ma Yunjie Ji, Yong Deng and Xiangang Li, “Belle: Be everyone’s large language model engine,” \urlhttps://github.com/LianjiaTech/BELLE, 2023.
  16. “Enhancing chat language models by scaling high-quality instructional conversations,” arXiv preprint arXiv:2305.14233, 2023.
  17. “Finetuned language models are zero-shot learners,” in International Conference on Learning Representations.
  18. “Chinese open instruction generalist: A preliminary release,” 2023.
  19. “TencentPretrain: A scalable and flexible toolkit for pre-training models of different modalities,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Toronto, Canada, July 2023, pp. 217–225, Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yudong Li (19 papers)
  2. Yuhao Feng (3 papers)
  3. Wen Zhou (38 papers)
  4. Zhe Zhao (97 papers)
  5. Linlin Shen (133 papers)
  6. Cheng Hou (4 papers)
  7. Xianxu Hou (24 papers)

Summary

Dynamic Data Sampler for Cross-Language Transfer Learning in LLMs

The paper under examination, titled "Dynamic Data Sampler for Cross-Language Transfer Learning in LLMs," addresses the substantial challenge of training LLMs for non-English languages, such as Chinese. This work, authored by Yudong Li et al., from institutions including Shenzhen University and Tencent AI Lab, introduces a novel approach named ChatFlow to facilitate cost-effective training via cross-language transfer learning.

Overview and Motivation

The prevalent LLMs, including models like LLaMA2, typically excel due to the availability of massive English-language corpora. However, the data disparity between languages presents significant obstacles in creating high-quality LLMs for languages like Chinese, which constitutes only 1.4% of the web corpus. Existing Chinese models such as ChatGLM and Baichuan often rely on private datasets, hindering reproducibility and broader research efforts. The proposed ChatFlow method aims to fill this void by leveraging English language resources to enhance Chinese LLMs using a cross-language transfer mechanism.

Methodology

Transfer Learning with Dynamic Data Sampler

ChatFlow stands on the shoulders of the LLaMA2-7B model, augmenting it with Chinese language capabilities through a methodical training regimen involving bilingual (Chinese and English) corpora and dynamic progressive data sampling. The dynamic data sampler plays a critical role by ensuring a smooth transition from unsupervised pre-training to supervised fine-tuning (SFT), inspired by curriculum learning principles.

Instead of abruptly shifting from pre-training to fine-tuning, the dynamic sampler gradually increases the proportion of Chinese data and supervised instruction tasks in the training batches. This careful calibration ensures appropriate representation learning and mitigates the risk of model confusion due to sudden changes in data distributions.

Training Data Composition

The training data comprises approximately 50GB, including unsupervised corpus, parallel Chinese-English corpus, and instruction data:

  • Parallel Corpus: Notable sources such as ParaCrawl v9 and WikiMatri help align cross-language representations, enabling efficient knowledge transfer from English to Chinese.
  • Unsupervised Corpus: Incorporates Chinese datasets such as CLUECorpus and CSL, and an English corpus subset such as RefinedWeb, preserving existing knowledge while expanding Chinese capabilities.
  • Instruction Data: Utilizes diverse sources like BELLE and UltraChat to enhance the model’s interaction proficiency.

Experimental Results

Performance Metrics

ChatFlow’s performance was rigorously evaluated on several benchmarks including MMLU, C-Eval, CMMLU, and GAOKAO:

  • Superior Performance: ChatFlow exhibited superior results compared to other Chinese models post-trained on LLaMA2-7B, such as HFL-Alpaca2, especially in the domains of Chinese understanding and bilingual capabilities.
  • Training Efficiency: The dynamic data sampler facilitated faster model convergence and higher stability across training stages, evidenced by tracking loss curves and performance metrics.

Human Evaluation

In a human evaluation on the SuperCLUE platform, ChatFlow ranked 5th among comparable 7B-scale models, with insights indicating its advantage in leveraging transfer learning from an English foundation model. It still trails behind state-of-the-art commercial models, indicating avenues for further enhancements.

Implications and Future Directions

The proposed methodology highlights important practical and theoretical implications:

  • Practical Utility: ChatFlow offers a reproducible and efficient framework for bilingual LLM training, with a significant focus on resource efficiency and open availability.
  • Theoretical Insights: The work underscores the importance of dynamic data sampling in transfer learning, providing empirical evidence of its benefits in stabilizing learning processes in multilingual contexts.

Future research directions may explore extending this approach to other languages with similarly limited training data, refining the dynamic data sampler mechanism, and integrating reinforcement learning from human feedback (RLHF) to further optimize model performance.

Conclusion

The paper introduces ChatFlow, a well-structured, cost-effective strategy for enhancing Chinese LLMs through cross-language transfer. By innovatively employing a dynamic data sampler and leveraging both bilingual and instruction datasets, the paper contributes a valuable reference point for future cross-linguistic AI model developments. With its successful outcomes and open-source commitment, ChatFlow represents a meaningful step toward inclusive and reproducible AI research initiatives.

X Twitter Logo Streamline Icon: https://streamlinehq.com