An Empirical Study of Instruction-tuning LLMs in Chinese
The paper provides a comprehensive empirical analysis of instruction-tuning LLMs in Chinese, a research area that remains underexplored despite the prominence of Chinese as the most spoken language globally. This paper is positioned within the context of burgeoning interest in LLMs following the success of models like ChatGPT and LLaMA, with a focus on creating a Chinese-centric analog to these technologies.
Core Components of Instruction-Tuning
The authors identify three critical components in the instruction-tuning process: LLM bases, parameter-efficient methods, and instruction datasets. By systematically examining these elements, the paper aims to optimize the instruction-following capabilities of LLMs tailored for Chinese.
- LLM Bases: The paper evaluates various open LLMs, such as LLaMA, Bloom, and Moss, highlighting Bloom's balanced performance across benchmarks due to its multilingual nature. In contrast, models like Vicuna and ChatGLM, although initially robust, show mixed results when subjected to further tuning using Alpaca-GPT4.
- Parameter-Efficient Methods: The paper assesses multiple methods, including LoRA, AdaLoRA, and prefix-tuning. LoRA emerges as a particularly effective approach, offering significant improvements with a manageable parameter footprint.
- Instruction Datasets: Diverse datasets, such as Alpaca-GPT4, Belle, and ShareGPT-zh, contribute varied strengths to the models. Belle's large dataset provides substantial gains, while ChatGPT-generated datasets improve broad instruction-following tasks.
Additional Factors Influencing Model Performance
The paper further explores several ancillary factors:
- Chain-of-Thought (CoT) Data: Incorporating CoT data can enhance reasoning abilities, beneficial for complex tasks, though with occasional trade-offs in broader performance.
- Vocabulary Expansion: Expanding Chinese vocabulary in models like LLaMA requires subsequent pre-training to be effective, underscoring the importance of tailored linguistic adaptation.
- Prompt Language: Instruction-tuning benefits from using native language prompts for models less attuned to Chinese, while models with established multilingual capabilities, like Bloom, perform well with English prompts.
- Human Value Alignment: The integration of human-value alignment data can lead to a minor drop in model performance, indicating a delicate balance between ethical considerations and technical efficacy.
Evaluation and Results
The models are evaluated using two benchmarks: Belle-eval for general instruction-following and MMCU for professional knowledge assessment. Results show that Chinese-centric instruction-tuning significantly enhances performance across tasks, with the newly released model rivaling existing models like ChatGLM, despite using fewer trainable parameters.
Implications and Future Directions
The findings from this paper have both practical and theoretical implications, particularly in customizing LLMs for Chinese applications while maintaining efficiency through parameter-efficient tuning methods. The authors' approach provides a methodological framework for future research, encouraging further exploration into optimizing LLMs with high-quality, diverse datasets.
Moreover, as AI continues to evolve, the adaptability of LLMs to different linguistic and cultural contexts will remain a priority. This research sets a precedent for further development of LLMs in various languages, emphasizing the need for detailed empirical analysis and methodical execution in fine-tuning processes.
Conclusion
This paper contributes valuable insights into the instruction-tuning of LLMs in Chinese, demonstrating the importance of model selection, dataset quality, and efficient tuning methodologies. The release of a competitive Chinese LLM underscores the potential to advance AI capabilities in nuanced and multilingual environments. Future research can build on these findings to enhance model robustness and cultural adaptability in LLMs globally.