Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction (2303.04132v2)

Published 7 Mar 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have great potential for synthetic data generation. This work shows that useful data can be synthetically generated even for tasks that cannot be solved directly by LLMs: for problems with structured outputs, it is possible to prompt an LLM to perform the task in the reverse direction, by generating plausible input text for a target output structure. Leveraging this asymmetry in task difficulty makes it possible to produce large-scale, high-quality data for complex tasks. We demonstrate the effectiveness of this approach on closed information extraction, where collecting ground-truth data is challenging, and no satisfactory dataset exists to date. We synthetically generate a dataset of 1.8M data points, establish its superior quality compared to existing datasets in a human evaluation, and use it to finetune small models (220M and 770M parameters), termed SynthIE, that outperform the prior state of the art (with equal model size) by a substantial margin of 57 absolute points in micro-F1 and 79 points in macro-F1. Code, data, and models are available at https://github.com/epfl-dlab/SynthIE.

Citations (63)

Summary

  • The paper introduces a novel method that reverses traditional data generation by converting structured outputs into natural language texts.
  • It employs knowledge graph filtering, triplet sampling, and GPT-based text generation to create a high-quality dataset of 1.8 million data points.
  • Experiments reveal that models trained on this synthetic data achieve significant improvements, with micro-F1 increases of 57 and macro-F1 gains of 79 points.

Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction

This paper presents a novel methodology for Synthetic Data Generation (SDG) in NLP tasks, specifically targeting the challenges inherent in structured output scenarios such as Closed Information Extraction (cIE). The authors propose reversing the typical generation task for LLMs, leveraging task asymmetry to produce high-quality synthetic datasets that significantly improve model performance.

Methodology

The core contribution of this research lies in exploiting asymmetry between input text and structured output for data generation. Traditional approaches often struggle with generating structured outputs directly due to the limitations of LLMs pretraining. This paper instead suggests prompting LLMs to generate plausible input texts from predefined structured outputs. In cIE, for example, this involves generating texts that accurately represent (subject, relation, object) triplets from a knowledge base like Wikidata.

The process consists of three strategic components:

  1. Knowledge Graph Construction: The authors refine the scope of possible entities and relations by filtering the Wikidata graph to relevant subsets, ensuring compatibility with existing databases such as REBEL.
  2. Triplet Sampling: Crucially, the paper describes a sampling procedure that emphasizes coherent triplet sets, ensuring coverage across a wide range of relations and entities through a mixed sampling strategy. This yields a balanced dataset far superior to the noise-heavy REBEL set.
  3. Text Generation: Using OpenAI’s GPT models, triplet sets are converted into text with demonstrations, optimizing data generation efficiency and quality.

Results

Empirical results are compelling, showcasing that their synthetic dataset comprising 1.8 million data points enables training of models (SynthIE) that dramatically outperform existing state-of-the-art systems. The SynthIE models, particularly when fine-tuned on Flan-T5 architectures, demonstrate an impressive increase of 57 points in micro-F1 and 79 points in macro-F1 metrics, compared to previously reported results.

The paper provides thorough evaluations indicating that the synthetically generated datasets not only improve model performance across all relation categories but also maintain consistency in annotation quality. The human evaluations reveal substantial enhancements in precision and recall, suggesting that the synthetic texts are more aligned with their triplet structures than REBEL’s original data.

Implications

In a field where data scarcity presents a core limitation, especially for tasks requiring structured outputs, this paper’s methodology offers a feasible path towards resolving such limitations. By enabling the generation of large, high-quality datasets, it opens new avenues for robust model training, enhancing precision in information extraction and potentially extending into other structured NLP tasks such as entity linking or abstract meaning representation parsing.

Furthermore, this approach could inform future developments in AI by providing a framework that can be adapted to various linguistic challenges beyond structured outputs. By addressing the limitations observed in existing datasets, this methodology may catalyze shifts toward more exhaustive, practical Information Extraction systems.

Concluding Remarks

This paper effectively demonstrates the potential of reversing task difficulty in LLMs for synthetic data generation. While the methodology has proven highly effective for cIE, its applicability spans broader linguistic challenges, suggesting exciting prospects for improving and expanding dataset generation techniques in NLP. As researchers continue to explore these avenues, such innovations hold the promise of advancing AI more comprehensively and accurately across myriad complex tasks.