Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models (2402.13064v1)

Published 20 Feb 2024 in cs.CL
Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Abstract: We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of LLMs. Unlike prior work that relies on seed examples or existing datasets to construct instruction tuning data, GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines. Specifically, inspired by the systematic structure in human education system, we build the taxonomy by decomposing human knowledge and capabilities to various fields, sub-fields and ultimately, distinct disciplines semi-automatically, facilitated by LLMs. Subsequently, we generate a comprehensive list of subjects for every discipline and proceed to design a syllabus tailored to each subject, again utilizing LLMs. With the fine-grained key concepts detailed in every class session of the syllabus, we are able to generate diverse instructions with a broad coverage across the entire spectrum of human knowledge and skills. Extensive experiments on LLMs (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions from mathematical reasoning, coding, academic exams, logical reasoning to general instruction following without using task-specific training data of these tasks. In addition, GLAN allows for easy customization and new fields or skills can be added by simply incorporating a new node into our taxonomy.

Advanced Synthetic Instruction Tuning for LLMs through GLAN

Introduction to Generalized Instruction Tuning

The advent of LLMs has significantly advanced the capacities of AI in understanding and generating human-like text. Despite these advancements, the direct instruction-following capabilities of LLMs remain a challenge. The novel GLAN (Generalized Instruction-Tuning for LLMs) methodology addresses this gap by generating synthetic instruction tuning data covering a wide range of human knowledge and capabilities. Unlike previous works that rely on seed examples or existing datasets, GLAN draws from a pre-curated taxonomy of human knowledge, enabling the generation of diverse instructions across all disciplines.

Methodology of GLAN

GLAN's approach is inspired by the systematic structure of the human education system, breaking down human knowledge into various fields, sub-fields, and disciplines. This process is facilitated through the use of LLMs and minimal human verification, making it both scalable and customizable. The key phases of the GLAN methodology involve:

  • Taxonomy Creation: Construction of a comprehensive taxonomy that guides the synthetic instruction generation process.
  • Subject and Syllabus Generation: Utilizing LLMs to generate a list of subjects for each discipline, followed by detailed syllabuses outlining class sessions and key concepts.
  • Instruction Generation: Leveraging class session and key concept details to generate diverse homework questions and their corresponding answers.

This methodology mirrors the structure of human educational systems, emphasizing the generation of high-quality, diverse instructional data.

Experimental Findings

Extensive experiments were conducted to test GLAN's effectiveness. Notably, GLAN demonstrated superior performance across several dimensions, including mathematical reasoning, coding, academic exams, logical reasoning, and general instruction following. The instruction dataset spans a wide array of subjects, with GLAN models outperforming or closely matching the results of leading models across various benchmarks.

Academic Exam Benchmarks: A Deeper Dive

A closer examination of performance on academic exams reveals GLAN's proficiency in STEM subjects due to its ability to generate solutions with Chain-of-Thought reasoning. However, there is room for improvement in humanities and social sciences, highlighting potential areas for further development.

Generalization Capabilities and Task-specific Training Data

Analysis on task-specific training data exclusion confirmed GLAN's generalization capabilities, with models avoiding convergence to any specific domain present in the evaluation benchmarks. Additionally, an instruction-following capability evaluation demonstrated GLAN's enhanced instruction-following abilities, albeit with opportunities for further improvement.

Future Directions

GLAN introduces a scalable, general methodology for synthetic instruction tuning that significantly improves LLMs' capabilities across multiple domains. The methodology's ability to generate diverse, high-quality instruction data without relying on task-specific datasets marks a significant step towards achieving better generalized instruction-following capabilities. Future works may explore expanding the taxonomy to include broader data types, generating multi-turn conversation datasets, and refining techniques to enhance performance further in less well-served subjects.

Conclusion

GLAN offers a novel, effective approach to instruction tuning, presenting a promising avenue for enhancing the generalization capabilities of LLMs. Through an advanced understanding of instructional data generation and strategic model training, it is poised to significantly advance the field of generative AI and LLM development.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  4. Sahil Chaudhary. Code alpaca: An instruction-following llama model for code generation. https://github.com/sahil280114/codealpaca, 2023.
  5. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023.
  6. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  7. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  8. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  9. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
  10. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717, 2023.
  11. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  12. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  13. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021.
  14. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  15. Unnatural instructions: Tuning language models with (almost) no human labor. ArXiv, abs/2212.09689, 2022.
  16. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  17. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  18. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
  19. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
  20. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583, 2023.
  21. Wizardcoder: Empowering code large language models with evol-instruct. arXiv preprint arXiv:2306.08568, 2023.
  22. Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045, 2023.
  23. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
  24. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786, 2022.
  25. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  26. Large language model alignment: A survey. arXiv preprint arXiv:2309.15025, 2023.
  27. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023.
  28. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  29. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, 2022.
  30. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  31. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  32. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  33. Education, 2023. Last edited on 24 March 2023.
  34. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  35. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705, 2022.
  36. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  37. Skywork: A more open bilingual foundation model, 2023.
  38. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196, 2023.
  39. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  40. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  41. Instruction-following evaluation for large language models. arXiv preprint arXiv:2311.07911, 2023.
  42. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (20)
  1. Haoran Li (166 papers)
  2. Qingxiu Dong (39 papers)
  3. Zhengyang Tang (13 papers)
  4. Chaojun Wang (10 papers)
  5. Xingxing Zhang (65 papers)
  6. Haoyang Huang (27 papers)
  7. Shaohan Huang (79 papers)
  8. Xiaolong Huang (29 papers)
  9. Zeqiang Huang (1 paper)
  10. Dongdong Zhang (79 papers)
  11. Yuxian Gu (21 papers)
  12. Xin Cheng (89 papers)
  13. Xun Wang (96 papers)
  14. Si-Qing Chen (22 papers)
  15. Li Dong (154 papers)
  16. Wei Lu (325 papers)
  17. Zhifang Sui (89 papers)
  18. Benyou Wang (109 papers)
  19. Wai Lam (117 papers)
  20. Furu Wei (291 papers)
Citations (31)

HackerNews

  1. Synthetic Data Almost from Scratch (2 points, 1 comment)