Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation (2504.12563v1)

Published 17 Apr 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Recent smaller LLMs such Phi-3.5 and Phi-4 rely on synthetic data generated using larger LLMs. Questions remain about leveraging synthetic data for other use cases, such as adapting LLMs to specific domains. A key limitation of synthetic data is low diversity, which negatively impacts its downstream applicability for improving other models. To address this, we propose MetaSynth, a method for generating synthetic data that enhances diversity through meta-prompting, where a LLM orchestrates multiple "expert" LLM agents to collaboratively generate data. Using only 25 million tokens of synthetic data generated with MetaSynth, we successfully adapt a well-trained LLM (Mistral-7B-v0.3) to two specialized domains-Finance and Biomedicine-without compromising the capabilities of the resulting model in general tasks. In addition, we evaluate the diversity of our synthetic data using seven automated metrics, and find that it approaches the diversity of LLM pre-training corpora. Continually pre-training Mistral-7B-v0.3 with MetaSynth notably outperforms the base LLM, showing improvements of up to 4.08% in Finance and 13.75% in Biomedicine. The same model shows degraded performance when trained on data generated using a template prompt, even when the template includes prior generations and varying In-Context exemplars of real data. Our findings suggest that a few million tokens of diverse synthetic data without mixing any real data, is sufficient for effective domain adaptation when using MetaSynth.

MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

The paper entitled "MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation" explores a novel approach to synthesizing diverse data using LLMs. Recognizing the limitations of current synthetic data generation methods—specifically low diversity—this research aims to enhance the synthetic data's richness using a process termed "meta-prompting." This involves orchestrating multiple LLMs, framed as "expert agents," to collaboratively generate data that closely emulates the diversity found in pre-training corpora.

A significant challenge in utilizing synthetic data for training smaller or specialized models like Phi-3.5 and Phi-4 is their inherent lack of diversity, which impairs their effectiveness in domain adaptation contexts. MetaSynth addresses this by employing an LLM to control various subprocesses through meta-prompts, which coordinate expert agents to iteratively refine and diversify the data generation process.

In contrast to traditional template-based prompts, which result in repetitive outputs due to predefined structure and lack of variation, the meta-prompting method enables a more dynamic generation process. It uses feedback and iterative refinement from agentic workflows across multiple perspectives and specializations. This is demonstrated through experiments that involved generating a synthetic dataset of 25 million tokens, successfully adapting a well-trained 7-billion parameter model, Mistral-7B, to specific domains like Finance and Biomedicine, sustaining general task capabilities without degradation.

The paper provides empirical evidence illustrating the superiority of the MetaSynth approach over template-based methods. By applying diverse synthetic data, continual pre-training of Mistral-7B showed notable performance improvements—up to 4.08% in Finance and 13.75% in Biomedicine, surpassing results obtained with data from template prompting, even when using the same quantity of tokens.

Methodologically, MetaSynth leverages meta-prompting to orchestrate and maintain diversity through a supervised agent network. These agents perform tasks such as seed keyword extraction, document summarization, and content analysis while engaging with prior generated content to assess and maximize diversity metrics. Through this multi-agent system orchestrated by a central meta-LM, the process overcomes pitfalls of "model collapse" and ensures continual improvement without real data contamination biases.

The diversity of the generated data is quantitatively validated across multiple metrics, including Task2Vec Diversity Coefficient, Compression Ratio, and N-Gram Diversity. These metrics, central to the paper’s evaluation framework, confirm that MetaSynth's outputs rival real corpora in their semantic and syntactic diversity, emulating the variance typically present in high-quality, human-curated datasets.

Furthermore, the synthesized instruction data serves as foundational material for instruction pre-training in BERT-like encoder models, showing superior performance on diagnostic tasks, indicative of the high-quality synthetic input's capacity for transferring learning.

While the research demonstrates compelling advancements, limitations remain, particularly the computational overhead associated with high-quality data generation, which necessitates a significant computational infrastructure. Nevertheless, the innovations presented provide a promising pathway for scalable, domain-adaptive synthetic data generation, pivotal to the ongoing evolution of LLM training methodologies.

The implications of this work are noteworthy for the fields of AI and natural language processing, as it suggests robust pathways for efficient domain adaptation using synthetic data sparingly interspersed with real data. Future developments could focus on optimizing meta-prompting algorithms to reduce computation costs and expanding the method's applicability to further specialized and resource-constrained domains. In conclusion, MetaSynth represents a significant step forward in reliable and diverse synthetic data generation, with broad potential applications across AI model training paradigms.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Haris Riaz (5 papers)
  2. Sourav Bhabesh (2 papers)
  3. Vinayak Arannil (3 papers)
  4. Miguel Ballesteros (70 papers)
  5. Graham Horwood (5 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com