AgentInstruct: Toward Generative Teaching with Agentic Flows (2407.03502v1)

Published 3 Jul 2024 in cs.AI, cs.CL, and cs.LG

Abstract: Synthetic data is becoming increasingly important for accelerating the development of LLMs, both large and small. Despite several successful use cases, researchers also raised concerns around model collapse and drawbacks of imitating other models. This discrepancy can be attributed to the fact that synthetic data varies in quality and diversity. Effective use of synthetic data usually requires significant human effort in curating the data. We focus on using synthetic data for post-training, specifically creating data by powerful models to teach a new skill or behavior to another model, we refer to this setting as Generative Teaching. We introduce AgentInstruct, an extensible agentic framework for automatically creating large amounts of diverse and high-quality synthetic data. AgentInstruct can create both the prompts and responses, using only raw data sources like text documents and code files as seeds. We demonstrate the utility of AgentInstruct by creating a post training dataset of 25M pairs to teach LLMs different skills, such as text editing, creative writing, tool usage, coding, reading comprehension, etc. The dataset can be used for instruction tuning of any base model. We post-train Mistral-7b with the data. When comparing the resulting model Orca-3 to Mistral-7b-Instruct (which uses the same base model), we observe significant improvements across many benchmarks. For example, 40% improvement on AGIEval, 19% improvement on MMLU, 54% improvement on GSM8K, 38% improvement on BBH and 45% improvement on AlpacaEval. Additionally, it consistently outperforms other models such as LLAMA-8B-instruct and GPT-3.5-turbo.

PDF HTML Abstract

AgentInstruct: Toward Generative Teaching with Agentic Flows

The paper "AgentInstruct: Toward Generative Teaching with Agentic Flows," authored by Arindam Mitra et al., focuses on the application of synthetic data to facilitate the development and post-training of LLMs. It introduces an innovative agentic framework, termed AgentInstruct, designed to generate high-quality, diverse synthetic data by leveraging powerful models and iterative workflows.

Overview

This research addresses the challenges of using synthetic data for model training. Prior work has shown that synthetic data can accelerate training, but there are concerns regarding model collapse when dependent on data generated by other models. These risks are attributed to variations in the quality and diversity of synthetic data. Traditionally, significant human effort is required to curate effective synthetic datasets. AgentInstruct introduces a systematic method to automate the generation of such data using multi-agent workflows, refined through reflection and iteration.

Key Contributions

AgentInstruct Framework: The core innovation lies in its extensible agentic workflows, which facilitate the automatic generation of large amounts of diverse, high-quality synthetic data. This approach circumvents the necessity for pre-defined prompts by using raw documents and source code as seeds.

Data Generation: The framework generates both prompts and responses using agentic flows. This involves:

Content Transformation: transforming raw data into intermediate forms conducive to task-specific instructions.
Instruction Creation: generating a variety of instructions from transformed data.
Instruction Refinement: enhancing the complexity and quality of instructions through iterative refinement.

Empirical Evaluation: AgentInstruct demonstrated its utility by creating a synthetic post-training dataset of 25 million pairs to teach LLMs various skills such as text editing, creative writing, and coding. Subsequent fine-tuning of the Mistral-7B model with this dataset resulted in the model Orca-3, which showed notable improvements over baseline models on benchmarks like AGIEval (40% improvement), MMLU (19% improvement), GSM8K (54% improvement), BBH (38% improvement), and AlpacaEval (45% improvement).

Detailed Insights

Reading Comprehension: AgentInstruct's workflows for reading comprehension involve generating diverse question types, ranging from literal comprehension to critical and evaluative questions. Empirical results indicate an 18% improvement over preceding models such as Orca-2.5 and a 21% enhancement relative to Mistral-7B-Instruct. Notably, the fine-tuned model's performance on LSAT reading comprehension matches GPT-4 capabilities, a significant achievement considering human-level difficulty.

Mathematical Reasoning: When assessing mathematical reasoning, Orca-3 showed substantial performance gains, including an improvement of up to 168% on the AGIEval math section. This underscores the robustness of AgentInstruct in teaching high-school to college-level math efficiently.

Format Following: Precise format adherence, essential for real-world applications, was enhanced by 11.5% using AgentInstruct data, allowing Orca-3 to surpass sophisticated models like Gemini Pro.

Summarization and Hallucination: The refinement flows resulted in a marked reduction in hallucination rates (down by 31.34%) while maintaining overall summary quality. This highlights the effectiveness of AgentInstruct in generating grounded, high-quality text.

Retrieval Augmented Generation (RAG): Evaluation on the MIRAGE benchmark demonstrated a 38.3% improvement on average, illustrating the capability of the generated dataset to enhance domain-specific knowledge retrieval and application.

Implications and Future Work

Practical Implications: The success of AgentInstruct implies a reduction in the cost and effort associated with human intervention for synthetic data generation. This offers significant potential for continual model improvement and customization across various domains, including finance, healthcare, and gaming, by employing domain-specific data as seeds.

Theoretical Implications: The ability of AgentInstruct to generate data that promotes skill learning rather than overfitting to specific benchmarks represents a paradigm shift in LLM training strategies. This approach may pave the way for more generalized, robust, and adaptable models.

Future Developments: Future work could explore automating the construction of agentic flows and validating the accuracy of generated data. There is also scope for extending this methodology to other stages of model training, including pre-training and domain-specific specializations. Additionally, addressing potential biases and costs associated with synthetic data generation remains a crucial area for further research.

In conclusion, while the capabilities of LLMs have been advancing rapidly, the introduction of AgentInstruct provides a structured, effective framework for leveraging synthetic data at scale. The empirical results reaffirm the utility of agentic flows in improving LLM performance across a wide range of tasks, marking a significant step forward in the field of generative teaching.