Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation (2401.06477v4)

Published 12 Jan 2024 in cs.CL and cs.AI

Abstract: In this paper, we introduce Kun, a novel approach for creating high-quality instruction-tuning datasets for LLMs without relying on manual annotations. Adapting a self-training algorithm based on instruction back-translation and answer polishment, Kun leverages unlabelled data from diverse sources such as Wudao, Wanjuan, and SkyPile to generate a substantial dataset of over a million Chinese instructional data points. This approach significantly deviates from traditional methods by using a self-curation process to refine and select the most effective instruction-output pairs. Our experiments with the 6B-parameter Yi model across various benchmarks demonstrate Kun's robustness and scalability. Our method's core contributions lie in its algorithmic advancement, which enhances data retention and clarity, and its innovative data generation approach that substantially reduces the reliance on costly and time-consuming manual annotations. This methodology presents a scalable and efficient solution for improving the instruction-following capabilities of LLMs, with significant implications for their application across diverse fields. The code and dataset can be found at https://github.com/Zheng0428/COIG-Kun

Kun: Answer Polishment for Chinese Self-Alignment with Instruction Back-Translation

The paper introduces a novel methodology named Kun, designed to improve the instructional tuning of LLMs for Chinese text. This is achieved by circumventing the need for manually annotated datasets, which are typically resource-intensive to produce. The authors propose a self-training approach leveraging instruction back-translation and answer polishment (AP) to generate a high-quality instruction-following dataset from unlabelled sources such as Wudao, Wanjuan, and SkyPile. The primary aim is to automatically curate and refine large datasets that enhance the operational efficiency of LLMs.

Methodology

Kun employs a self-curation strategy that relies on adapting a self-training algorithm, integrating the novel processes of instruction back-translation and answer polishment. This approach effectively bridges the gap between raw instruction data and their corresponding outputs, thereby ensuring more contextually relevant datasets. The method operates independently of traditional LLMs, showcasing the potential for scalability in generating instruction-following capabilities without heavy reliance on manual annotations.

Experiments and Results

Empirical evaluations were conducted utilizing the 6-billion-parameter Yi model, chosen for its open-source accessibility and reliable performance metrics. The experiments span several standard and comprehensive benchmarks, such as C-EVAL and CMMLU, specifically focusing on the effectiveness of the instruction datasets produced through Kun. The human evaluation encompassed 500 prompts from ShareGPT-zh, covering various tasks to compare model outputs with those of other LLMs.

Notably, the experiments demonstrated that the Kun-52k variant exhibited a performance edge over other models, specifically through heightened output quality as ascertained by human evaluation metrics. A critical component of the methodology's success was the identification that scoring the instruction component more significantly affected the final quality than scoring both components collectively.

Contributions

The notable contributions of this paper include:

  • Algorithmic Advancement: The introduction of the answer polishment (AP) process improves data coherence and clarity, leading to a more expansive and higher quality dataset for fine-tuning purposes.
  • Scalable Data Generation: Over one million Chinese instructional data points were curated from unlabelled data, challenging the traditional need for extensive human labor in data annotation processes.

Implications and Future Directions

Practically, the development of Kun suggests a scalable and efficient route for enhancing the instruction-following capabilities of LLMs, with wide-ranging applicability across diverse fields that rely on LLMs. Theoretically, it empowers further research into data generation methods that operate independently of costly manual data annotation mechanisms. Future developments could explore the implementation of Kun-like strategies to other languages and domains, further expanding the methodological applications of AI within global contexts.

Overall, Kun represents a significant shift in the methodology of training LLMs, presenting a potentially impactful alternative to current data annotation practices. It opens avenues for broader application and scalability in AI, providing a useful template for similar challenges in the ever-expanding field of language processing technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631.
  2. BAAI. 2023a. Coig-pc.
  3. BAAI. 2023b. Coig-pc-lite.
  4. Qwen technical report. arXiv preprint arXiv:2309.16609.
  5. Self-play fine-tuning converts weak language models to strong language models. arXiv preprint arXiv:2401.01335.
  6. Wanyun Cui and Qianle Wang. 2023. Ada-instruct: Adapting instruction generators for complex reasoning. arXiv preprint arXiv:2310.04484.
  7. Musilingo: Bridging music and text with pre-trained language models for music captioning and query response. arXiv preprint arXiv:2309.08730.
  8. Alpacafarm: A simulation framework for methods that learn from human feedback.
  9. Wanjuan: A comprehensive multimodal dataset for advancing english and chinese large models.
  10. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. CoRR, abs/2305.08322.
  11. Exploring the impact of instruction data scaling on large language models: An empirical study on real-world use cases.
  12. Tigerscore: Towards building explainable metric for all text generation tasks. arXiv preprint arXiv:2310.00752.
  13. Mertech: Instrument playing technique detection using self-supervised pretrained model with multi-task finetuning. arXiv preprint arXiv:2310.09853.
  14. CMMLU: measuring massive multitask language understanding in chinese. CoRR, abs/2306.09212.
  15. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259.
  16. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  17. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568.
  18. Cross-task generalization via natural language crowdsourcing instructions. In ACL.
  19. OL-CC. 2023. Openlabel-chinese conversations dataset (ol-cc).
  20. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  21. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  22. Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
  23. Moss: Training conversational language models from synthetic data.
  24. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7.
  25. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  26. Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975.
  27. Self-instruct: Aligning language model with self generated instructions.
  28. Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks. In EMNLP.
  29. Interactive natural language processing. arXiv preprint arXiv:2305.13246.
  30. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  31. Skywork: A more open bilingual foundation model.
  32. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244.
  33. Superclue: A comprehensive chinese large language model benchmark. arXiv preprint arXiv:2307.15020.
  34. WuDaoCorpora Text.
  35. Jianxin Yang. 2023. Firefly: Chinese conversational large language model. https://github.com/yangjianxin1/Firefly.
  36. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
  37. Chinese open instruction generalist: A preliminary release.
  38. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Tianyu Zheng (28 papers)
  2. Shuyue Guo (10 papers)
  3. Xingwei Qu (30 papers)
  4. Jiawei Guo (16 papers)
  5. Xinrun Du (23 papers)
  6. Chenghua Lin (127 papers)
  7. Wenhao Huang (98 papers)
  8. Jie Fu (229 papers)
  9. Ge Zhang (170 papers)
  10. Qi Jia (42 papers)
Citations (5)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com