Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DARE: Data Augmented Relation Extraction with GPT-2 (2004.13845v1)

Published 6 Apr 2020 in cs.CL, cs.LG, and stat.ML

Abstract: Real-world Relation Extraction (RE) tasks are challenging to deal with, either due to limited training data or class imbalance issues. In this work, we present Data Augmented Relation Extraction(DARE), a simple method to augment training data by properly fine-tuning GPT-2 to generate examples for specific relation types. The generated training data is then used in combination with the gold dataset to train a BERT-based RE classifier. In a series of experiments we show the advantages of our method, which leads in improvements of up to 11 F1 score points against a strong base-line. Also, DARE achieves new state of the art in three widely used biomedical RE datasets surpassing the previous best results by 4.7 F1 points on average.

DARE: Data Augmented Relation Extraction with GPT-2

The paper introduces a novel approach for enhancing Relation Extraction (RE) tasks through data augmentation using GPT-2, named DARE (Data Augmented Relation Extraction). RE tasks are integral to identifying semantic relationships between entities in text, yet these tasks often face challenges related to training data scarcity or class imbalance. This research paper proposes a method to mitigate these issues by leveraging GPT-2's ability to generate synthetic training examples for specific relation types. The augmented datasets are subsequently utilized to enhance the performance of BERT-based RE classifiers.

Methodology

The authors employ a two-step strategy for data augmentation. Initially, they fine-tune a pre-trained GPT-2 model on individual relation types within a RE dataset. Each fine-tuned GPT-2 model generates new training samples specific to its relation type. Subsequently, these synthetically generated datasets are amalgamated with the original gold-standard datasets to train BERT-based RE classifiers. The paper employs an ensemble of classifiers trained on various subsets of the generated data combined with gold-standard data, thus addressing the noise typical in generated samples and increasing robustness.

Experimental Evaluation

The method was tested on three biomedical RE datasets: CDR, DDI2013, and ChemProt, which encapsulate varying degrees of class imbalance and limited positive samples. The experiments demonstrated remarkable improvements in classification performance. Specifically, DARE improved F1 scores by up to 11 points compared to strong baselines in extremely unbalanced datasets. Additionally, DARE achieved new state-of-the-art results across all three datasets, surpassing existing benchmarks by an average of 4.7 F1 points.

Implications and Future Directions

The implications of this research are significant for the development and future utilization of RE systems, particularly in domains where data availability is a persistent challenge. By automating the generation of diverse training data without reliance on domain expertise or manually curated augmentations, DARE presents a scalable solution for enhancing text classification tasks. The potential theoretical contributions include refined techniques in text data generation while integrating and balancing classifier ensembles.

Future work may explore applying similar data augmentation techniques to other natural language processing tasks, adjusting GPT-2 fine-tuning methods or exploring alternative architectures. Expanding experiments to domains beyond biomedical texts could validate the versatility of the approach. Further refinement in controlling generated data quality or noise reduction strategies could enhance the applicability of synthetic data in training robust classifiers.

In summary, DARE's integration of data augmentation through GPT-2 offers a promising enhancement to RE tasks, providing a substantial performance boost in scenarios plagued by class imbalance and limited data, signaling a notable advancement in text data augmentation strategies.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yannis Papanikolaou (10 papers)
  2. Andrea Pierleoni (8 papers)
Citations (73)
Youtube Logo Streamline Icon: https://streamlinehq.com