Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AgentInstruct: Toward Generative Teaching with Agentic Flows (2407.03502v1)

Published 3 Jul 2024 in cs.AI, cs.CL, and cs.LG
AgentInstruct: Toward Generative Teaching with Agentic Flows

Abstract: Synthetic data is becoming increasingly important for accelerating the development of LLMs, both large and small. Despite several successful use cases, researchers also raised concerns around model collapse and drawbacks of imitating other models. This discrepancy can be attributed to the fact that synthetic data varies in quality and diversity. Effective use of synthetic data usually requires significant human effort in curating the data. We focus on using synthetic data for post-training, specifically creating data by powerful models to teach a new skill or behavior to another model, we refer to this setting as Generative Teaching. We introduce AgentInstruct, an extensible agentic framework for automatically creating large amounts of diverse and high-quality synthetic data. AgentInstruct can create both the prompts and responses, using only raw data sources like text documents and code files as seeds. We demonstrate the utility of AgentInstruct by creating a post training dataset of 25M pairs to teach LLMs different skills, such as text editing, creative writing, tool usage, coding, reading comprehension, etc. The dataset can be used for instruction tuning of any base model. We post-train Mistral-7b with the data. When comparing the resulting model Orca-3 to Mistral-7b-Instruct (which uses the same base model), we observe significant improvements across many benchmarks. For example, 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and 45% improvement on AlpacaEval. Additionally, it consistently outperforms other models such as LLAMA-8B-instruct and GPT-3.5-turbo.

AgentInstruct: Toward Generative Teaching with Agentic Flows

The paper "AgentInstruct: Toward Generative Teaching with Agentic Flows," authored by Arindam Mitra et al., focuses on the application of synthetic data to facilitate the development and post-training of LLMs. It introduces an innovative agentic framework, termed AgentInstruct, designed to generate high-quality, diverse synthetic data by leveraging powerful models and iterative workflows.

Overview

This research addresses the challenges of using synthetic data for model training. Prior work has shown that synthetic data can accelerate training, but there are concerns regarding model collapse when dependent on data generated by other models. These risks are attributed to variations in the quality and diversity of synthetic data. Traditionally, significant human effort is required to curate effective synthetic datasets. AgentInstruct introduces a systematic method to automate the generation of such data using multi-agent workflows, refined through reflection and iteration.

Key Contributions

AgentInstruct Framework: The core innovation lies in its extensible agentic workflows, which facilitate the automatic generation of large amounts of diverse, high-quality synthetic data. This approach circumvents the necessity for pre-defined prompts by using raw documents and source code as seeds.

Data Generation: The framework generates both prompts and responses using agentic flows. This involves:

  1. Content Transformation: transforming raw data into intermediate forms conducive to task-specific instructions.
  2. Instruction Creation: generating a variety of instructions from transformed data.
  3. Instruction Refinement: enhancing the complexity and quality of instructions through iterative refinement.

Empirical Evaluation: AgentInstruct demonstrated its utility by creating a synthetic post-training dataset of 25 million pairs to teach LLMs various skills such as text editing, creative writing, and coding. Subsequent fine-tuning of the Mistral-7B model with this dataset resulted in the model Orca-3, which showed notable improvements over baseline models on benchmarks like AGIEval (40% improvement), MMLU (19% improvement), GSM8K (54% improvement), BBH (38% improvement), and AlpacaEval (45% improvement).

Detailed Insights

Reading Comprehension: AgentInstruct's workflows for reading comprehension involve generating diverse question types, ranging from literal comprehension to critical and evaluative questions. Empirical results indicate an 18% improvement over preceding models such as Orca-2.5 and a 21% enhancement relative to Mistral-7B-Instruct. Notably, the fine-tuned model's performance on LSAT reading comprehension matches GPT-4 capabilities, a significant achievement considering human-level difficulty.

Mathematical Reasoning: When assessing mathematical reasoning, Orca-3 showed substantial performance gains, including an improvement of up to 168% on the AGIEval math section. This underscores the robustness of AgentInstruct in teaching high-school to college-level math efficiently.

Format Following: Precise format adherence, essential for real-world applications, was enhanced by 11.5% using AgentInstruct data, allowing Orca-3 to surpass sophisticated models like Gemini Pro.

Summarization and Hallucination: The refinement flows resulted in a marked reduction in hallucination rates (down by 31.34%) while maintaining overall summary quality. This highlights the effectiveness of AgentInstruct in generating grounded, high-quality text.

Retrieval Augmented Generation (RAG): Evaluation on the MIRAGE benchmark demonstrated a 38.3% improvement on average, illustrating the capability of the generated dataset to enhance domain-specific knowledge retrieval and application.

Implications and Future Work

Practical Implications: The success of AgentInstruct implies a reduction in the cost and effort associated with human intervention for synthetic data generation. This offers significant potential for continual model improvement and customization across various domains, including finance, healthcare, and gaming, by employing domain-specific data as seeds.

Theoretical Implications: The ability of AgentInstruct to generate data that promotes skill learning rather than overfitting to specific benchmarks represents a paradigm shift in LLM training strategies. This approach may pave the way for more generalized, robust, and adaptable models.

Future Developments: Future work could explore automating the construction of agentic flows and validating the accuracy of generated data. There is also scope for extending this methodology to other stages of model training, including pre-training and domain-specific specializations. Additionally, addressing potential biases and costs associated with synthetic data generation remains a crucial area for further research.

In conclusion, while the capabilities of LLMs have been advancing rapidly, the introduction of AgentInstruct provides a structured, effective framework for leveraging synthetic data at scale. The empirical results reaffirm the utility of agentic flows in improving LLM performance across a wide range of tasks, marking a significant step forward in the field of generative teaching.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Phi-3 technical report: A highly capable language model locally on your phone, 2024. URL https://arxiv.org/abs/2404.14219.
  2. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  3. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  4. CodeParrot. Github-code clean dataset, 2022. https://huggingface.co/datasets/codeparrot/github-code-clean [Accessed: (06/15/2024)].
  5. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
  6. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1246. URL https://aclanthology.org/N19-1246.
  7. Query of cc: Unearthing large scale domain-specific knowledge from public corpora. arXiv preprint arXiv:2401.14624, 2024.
  8. The false promise of imitating proprietary llms, 2023. URL https://arxiv.org/abs/2305.15717.
  9. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021.
  10. Camels in a changing climate: Enhancing lm adaptation with tulu 2, 2023. URL https://arxiv.org/abs/2311.10702.
  11. Mistral 7b, 2023.
  12. Rlaif: Scaling reinforcement learning from human feedback with ai feedback, 2023. URL https://arxiv.org/abs/2309.00267.
  13. Camel: Communicative agents for "mind" exploration of large language model society, 2023a. URL https://arxiv.org/abs/2303.17760.
  14. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
  15. Benchmarking generation and evaluation capabilities of large language models for instruction controllable summarization, 2023. URL https://arxiv.org/abs/2311.09184.
  16. Lm-sys. Mt-Bench, 2023. URL https://huggingface.co/spaces/lmsys/mt-bench/tree/cf27f9f9da48f72169bce3c3e784d24347d1e833/data/mt_bench/model_answer.
  17. Daniel van Strien Loubna Ben Allal, Anton Lozhkov. Cosmopedia: how to create large-scale synthetic data for pre-training, 2024. URL https://huggingface.co/blog/cosmopedia.
  18. Orca 2: Teaching small language models how to reason, 2023. URL https://arxiv.org/abs/2311.11045.
  19. Orca-math: Unlocking the potential of slms in grade school math. arXiv preprint arXiv:2402.14830, 2024.
  20. Xtremedistil: Multi-stage distillation for massive multilingual models, 2020.
  21. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707, 2023.
  22. OpenAI. Gpt-4 technical report, 2023.
  23. Samuel J. Paech. Eq-bench: An emotional intelligence benchmark for large language models, 2024. URL https://arxiv.org/abs/2312.06281.
  24. Instruction tuning with gpt-4, 2023. URL https://arxiv.org/abs/2304.03277.
  25. Infobench: Evaluating instruction following ability in large language models, 2024. URL https://arxiv.org/abs/2401.03601.
  26. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023.
  27. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URL https://arxiv.org/abs/2311.12022.
  28. Direct nash optimization: Teaching language models to self-improve with general preferences, 2024. URL https://arxiv.org/abs/2404.03715.
  29. The curse of recursion: Training on generated data makes models forget, 2024. URL https://arxiv.org/abs/2305.17493.
  30. Re(gex|dos)eval: Evaluating generated regular expressions and their proneness to dos attacks. In Proceedings of the 46th International Conference on Software Engineering, NIER Track (ICSE-NIER ’24), 2024. doi: 10.1145/3639476.3639757.
  31. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  32. Aci-bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation, 2023.
  33. Autogen: Enabling next-gen llm applications via multi-agent conversation, 2023. URL https://arxiv.org/abs/2308.08155.
  34. Fofo: A benchmark to evaluate llms’ format-following capability, 2024. URL https://arxiv.org/abs/2402.18667.
  35. Benchmarking retrieval-augmented generation for medicine. arXiv preprint arXiv:2402.13178, 2024.
  36. Wizardlm: Empowering large language models to follow complex instructions, 2023.
  37. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  38. Automathtext: Autonomous data selection with language models for mathematical texts. arXiv preprint arXiv:2402.07625, 2024.
  39. Agieval: A human-centric benchmark for evaluating foundation models, 2023.
  40. Instruction-following evaluation for large language models, 2023. URL https://arxiv.org/abs/2311.07911.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Arindam Mitra (40 papers)
  2. Luciano Del Corro (9 papers)
  3. Guoqing Zheng (25 papers)
  4. Shweti Mahajan (6 papers)
  5. Dany Rouhana (1 paper)
  6. Andres Codas (5 papers)
  7. Yadong Lu (19 papers)
  8. Wei-Ge Chen (2 papers)
  9. Olga Vrousgos (1 paper)
  10. Corby Rosset (21 papers)
  11. Fillipe Silva (1 paper)
  12. Hamed Khanpour (6 papers)
  13. Yash Lara (3 papers)
  14. Ahmed Awadallah (27 papers)
Citations (12)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com