Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning (2403.18058v2)

Published 26 Mar 2024 in cs.CL and cs.AI
COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Abstract: Remarkable progress on English instruction tuning has facilitated the efficacy and reliability of LLMs. However, there remains a noticeable gap in instruction tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets, generally distilled from English-centric LLMs, are not well-aligned with Chinese users' interaction patterns. To bridge this gap, we introduce COIG-CQIA, a new Chinese instruction tuning dataset derived from various real-world resources and undergoing rigorous human verification. We conduct extensive experiments on COIG-CQIA, and compare them with strong baseline models and datasets. The experimental results show that models trained on COIG-CQIA achieve highly competitive performance in diverse benchmarks. Additionally, our findings offer several insights for designing effective Chinese instruction-tuning datasets and data-mixing strategies. Our dataset are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA.

COIG-CQIA: A High-Quality Chinese Instruction Tuning Dataset for Improved Human-Like Interaction

Introduction to COIG-CQIA

The evolution of LLMs has drastically enhanced machine understanding and response generation capabilities, especially in the context of instruction-following tasks. However, the existing resources for instruction tuning predominantly cater to English, leaving a significant void in high-quality datasets for Chinese instruction fine-tuning. This gap impairs the development of models capable of understanding and executing instructions in Chinese with high fidelity. To address this, the introduction of the COIG-CQIA dataset marks a significant step forward. It aims to offer a comprehensive corpus tailored for instruction tuning in Chinese, meticulously curated from diverse, authentic internet sources and processed to meet high-quality standards.

Dataset Curation

COIG-CQIA stands out due to its methodical curation process and the wealth of sources it taps into for data collection. The dataset is derived from a mixture of social media platforms, Q&A communities, encyclopedias, exams, and existing NLP datasets, ensuring a broad coverage that spans both formal and informal usage, as well as a variety of domains such as STEM, humanities, and general knowledge.

The compilation process involved rigorous steps to ensure the quality and relevance of the data:

  • Filtering and Processing: Utilized both automated and manual review processes to filter out low-quality content, irrelevant information, and to ensure the cleanliness of the data.
  • Diverse Sources: Collected data from over 22 unique sources, including prominent Chinese websites and forums, ensuring a rich diversity in the types of instruction-response pairs in the dataset.

Dataset Composition and Characteristics

  • Task Variety: COIG-CQIA encompasses a wide array of task types, from question answering and knowledge extraction to generation tasks, facilitating comprehensive model training.
  • Volume and Diversity: The dataset boasts of 48,375 instances, a testament to its volume and the diversity it encapsulates. This variety is crucial for training models to understand and generate a wide range of responses.

Data Analysis and Evaluation

The dataset was rigorously analyzed to ascertain its diversity, quality, and coverage. The potential influence of the data sourced from various platforms on model performance was also evaluated across different benchmarks, demonstrating the dataset's effectiveness in enhancing models' capacity for understanding and executing Chinese instructions accurately.

Experimental Findings and Implications

Models trained on the COIG-CQIA dataset showcased competitive results in both human assessment and benchmark evaluations, particularly highlighting its efficacy in tasks requiring deep understanding and complex response generation. This finding underscores COIG-CQIA's potential to significantly contribute to advancing the development of instruction-tuned LLMs capable of comprehensively understanding and interacting in Chinese.

Conclusion and Future Directions

The development of COIG-CQIA represents a formidable stride towards bridging the gap in resources for Chinese instruction tuning tasks. Its comprehensive curation from a wide range of sources, coupled with the meticulous cleaning and processing efforts, ensures high-quality and diversity, making it an invaluable asset for the Chinese NLP community.

The dataset’s release invites further research and exploration into instruction tuning for Chinese LLMs, with the potential to pave the way for models that demonstrate improved alignment with human interactions in Chinese. As the NLP field continues to evolve, datasets like COIG-CQIA will be instrumental in fostering advancements that bring us closer to achieving truly human-like interaction capabilities in AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Qwen technical report. arXiv preprint arXiv:2309.16609.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  3. Alpagasus: Training a better alpaca with fewer data.
  4. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  5. Scaling instruction-finetuned language models.
  6. CLUEbenchmark. 2022. pclue: Large-scale prompt-based dataset for multi-task and zero-shot learning in chinese.
  7. Free dolly: Introducing the world’s first truly open instruction-tuned llm.
  8. How close is chatgpt to human experts? comparison corpus, evaluation, and detection.
  9. Han He and Jinho D Choi. 2021. The stem cell hypothesis: Dilemma behind multi-task learning with transformer encoders. arXiv preprint arXiv:2109.06939.
  10. Unnatural instructions: Tuning language models with (almost) no human labor.
  11. Camels in a changing climate: Enhancing lm adaptation with tulu 2.
  12. Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases.
  13. Critical behavior of the fluctuation heat capacity near the glass transition of metallic glasses.
  14. Self-alignment with instruction backtranslation.
  15. Muffin: Curating multi-faceted instructions for improving instruction following. In The Twelfth International Conference on Learning Representations.
  16. Cross-task generalization via natural language crowdsourcing instructions.
  17. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  18. Multitask prompted training enables zero-shot task generalization.
  19. Dynamics of instruction tuning: Each ability of large language models has its own growth pace.
  20. Moss: Training conversational language models from synthetic data.
  21. InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
  22. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  23. Self-instruct: Aligning language models with self-generated instructions.
  24. Wizardlm: Empowering large language models to follow complex instructions.
  25. Baize: An open-source chat model with parameter-efficient tuning on self-chat data.
  26. Jianxin Yang. 2023. Firefly(流萤): 中文对话式大语言模型. https://github.com/yangjianxin1/Firefly.
  27. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652.
  28. Chinese open instruction generalist: A preliminary release.
  29. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792.
  30. Lima: Less is more for alignment.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (22)
  1. Yuelin Bai (13 papers)
  2. Xinrun Du (23 papers)
  3. Yiming Liang (22 papers)
  4. Yonggang Jin (7 papers)
  5. Ziqiang Liu (16 papers)
  6. Junting Zhou (11 papers)
  7. Tianyu Zheng (28 papers)
  8. Xincheng Zhang (7 papers)
  9. Nuo Ma (1 paper)
  10. Zekun Wang (50 papers)
  11. Ruibin Yuan (43 papers)
  12. Haihong Wu (5 papers)
  13. Hongquan Lin (3 papers)
  14. Wenhao Huang (98 papers)
  15. Jiajun Zhang (176 papers)
  16. Chenghua Lin (127 papers)
  17. Jie Fu (229 papers)
  18. Min Yang (239 papers)
  19. Shiwen Ni (34 papers)
  20. Ge Zhang (170 papers)
Citations (19)
Youtube Logo Streamline Icon: https://streamlinehq.com