Dynamic data sampler for cross-language transfer learning in large language models (2405.10626v1)

Published 17 May 2024 in cs.CL

Abstract: LLMs have gained significant attention in the field of NLP due to their wide range of applications. However, training LLMs for languages other than English poses significant challenges, due to the difficulty in acquiring large-scale corpus and the requisite computing resources. In this paper, we propose ChatFlow, a cross-language transfer-based LLM, to address these challenges and train large Chinese LLMs in a cost-effective manner. We employ a mix of Chinese, English, and parallel corpus to continuously train the LLaMA2 model, aiming to align cross-language representations and facilitate the knowledge transfer specifically to the Chinese LLM. In addition, we use a dynamic data sampler to progressively transition the model from unsupervised pre-training to supervised fine-tuning. Experimental results demonstrate that our approach accelerates model convergence and achieves superior performance. We evaluate ChatFlow on popular Chinese and English benchmarks, the results indicate that it outperforms other Chinese models post-trained on LLaMA-2-7B.

References (19)

Authors (7)

Yudong Li (19 papers)
Yuhao Feng (3 papers)
Wen Zhou (38 papers)
Zhe Zhao (97 papers)
Linlin Shen (133 papers)
Cheng Hou (4 papers)
Xianxu Hou (24 papers)

Summary

Dynamic Data Sampler for Cross-Language Transfer Learning in LLMs

The paper under examination, titled "Dynamic Data Sampler for Cross-Language Transfer Learning in LLMs," addresses the substantial challenge of training LLMs for non-English languages, such as Chinese. This work, authored by Yudong Li et al., from institutions including Shenzhen University and Tencent AI Lab, introduces a novel approach named ChatFlow to facilitate cost-effective training via cross-language transfer learning.

Overview and Motivation

The prevalent LLMs, including models like LLaMA2, typically excel due to the availability of massive English-language corpora. However, the data disparity between languages presents significant obstacles in creating high-quality LLMs for languages like Chinese, which constitutes only 1.4% of the web corpus. Existing Chinese models such as ChatGLM and Baichuan often rely on private datasets, hindering reproducibility and broader research efforts. The proposed ChatFlow method aims to fill this void by leveraging English language resources to enhance Chinese LLMs using a cross-language transfer mechanism.

Methodology

Transfer Learning with Dynamic Data Sampler

ChatFlow stands on the shoulders of the LLaMA2-7B model, augmenting it with Chinese language capabilities through a methodical training regimen involving bilingual (Chinese and English) corpora and dynamic progressive data sampling. The dynamic data sampler plays a critical role by ensuring a smooth transition from unsupervised pre-training to supervised fine-tuning (SFT), inspired by curriculum learning principles.

Instead of abruptly shifting from pre-training to fine-tuning, the dynamic sampler gradually increases the proportion of Chinese data and supervised instruction tasks in the training batches. This careful calibration ensures appropriate representation learning and mitigates the risk of model confusion due to sudden changes in data distributions.

Training Data Composition

The training data comprises approximately 50GB, including unsupervised corpus, parallel Chinese-English corpus, and instruction data:

Parallel Corpus: Notable sources such as ParaCrawl v9 and WikiMatri help align cross-language representations, enabling efficient knowledge transfer from English to Chinese.
Unsupervised Corpus: Incorporates Chinese datasets such as CLUECorpus and CSL, and an English corpus subset such as RefinedWeb, preserving existing knowledge while expanding Chinese capabilities.
Instruction Data: Utilizes diverse sources like BELLE and UltraChat to enhance the model’s interaction proficiency.

Experimental Results

Performance Metrics

ChatFlow’s performance was rigorously evaluated on several benchmarks including MMLU, C-Eval, CMMLU, and GAOKAO:

Superior Performance: ChatFlow exhibited superior results compared to other Chinese models post-trained on LLaMA2-7B, such as HFL-Alpaca2, especially in the domains of Chinese understanding and bilingual capabilities.
Training Efficiency: The dynamic data sampler facilitated faster model convergence and higher stability across training stages, evidenced by tracking loss curves and performance metrics.

Human Evaluation

In a human evaluation on the SuperCLUE platform, ChatFlow ranked 5th among comparable 7B-scale models, with insights indicating its advantage in leveraging transfer learning from an English foundation model. It still trails behind state-of-the-art commercial models, indicating avenues for further enhancements.

Implications and Future Directions

The proposed methodology highlights important practical and theoretical implications:

Practical Utility: ChatFlow offers a reproducible and efficient framework for bilingual LLM training, with a significant focus on resource efficiency and open availability.
Theoretical Insights: The work underscores the importance of dynamic data sampling in transfer learning, providing empirical evidence of its benefits in stabilizing learning processes in multilingual contexts.

Future research directions may explore extending this approach to other languages with similarly limited training data, refining the dynamic data sampler mechanism, and integrating reinforcement learning from human feedback (RLHF) to further optimize model performance.

Conclusion

The paper introduces ChatFlow, a well-structured, cost-effective strategy for enhancing Chinese LLMs through cross-language transfer. By innovatively employing a dynamic data sampler and leveraging both bilingual and instruction datasets, the paper contributes a valuable reference point for future cross-linguistic AI model developments. With its successful outcomes and open-source commitment, ChatFlow represents a meaningful step toward inclusive and reproducible AI research initiatives.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/TheTuringPost/status/1796555519764967663