EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks (1901.11196v2)

Published 31 Jan 2019 in cs.CL

Abstract: We present EDA: easy data augmentation techniques for boosting performance on text classification tasks. EDA consists of four simple but powerful operations: synonym replacement, random insertion, random swap, and random deletion. On five text classification tasks, we show that EDA improves performance for both convolutional and recurrent neural networks. EDA demonstrates particularly strong results for smaller datasets; on average, across five datasets, training with EDA while using only 50% of the available training set achieved the same accuracy as normal training with all available data. We also performed extensive ablation studies and suggest parameters for practical use.

PDF Abstract

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

The paper "EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks" introduces a set of straightforward yet effective data augmentation techniques specifically designed for NLP tasks. These techniques are termed as Easy Data Augmentation (EDA) and consist of four primary operations: synonym replacement (SR), random insertion (RI), random swap (RS), and random deletion (RD).

Outline and Contributions

The primary contributions of this paper include:

Introduction of EDA: A suite of basic text editing techniques aimed at improving model performance on text classification tasks.
Impact on Smaller Datasets: Demonstrates that EDA techniques significantly boost the model performance, especially on smaller datasets.
Systematic Evaluation: Evaluation across five benchmark datasets and two commonly used neural architectures (Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)).
Extensive Ablation Studies: Detailed ablation studies that isolate the impact of each augmentation technique and provide practical guidelines for their effective usage.

Methodology

EDA employs four simple operations for augmenting text data:

Synonym Replacement (SR): Replaces words in a sentence with their synonyms.
Random Insertion (RI): Randomly inserts synonyms of randomly chosen words at random positions in the sentence.
Random Swap (RS): Swaps the positions of randomly chosen word pairs in the sentence.
Random Deletion (RD): Removes words from a sentence with a fixed probability.

These techniques are straightforward to implement and do not require significant computational resources.

Experimental Setup

The experiments are conducted on five text classification tasks:

SST-2: Stanford Sentiment Treebank.
CR: Customer Reviews.
SUBJ: Subjectivity/Objectivity Dataset.
TREC: Question Type Dataset.
PC: Pro-Con Dataset.

The models evaluated include RNNs, specifically utilizing Long Short-Term Memory (LSTM) cells, and CNNs configured for text classification.

Results

EDA demonstrates notable performance improvements, particularly evident in scenarios with limited training data. Key findings include:

Overall Improvement: EDA achieved an average improvement of 0.8% for full datasets and 3.0% for datasets restricted to 500 training samples.
Efficiency with Limited Data: When using only 50% of the training data augmented by EDA, models achieved performance comparable to using the full dataset without EDA.
Visual Consistency: Latent space visualization indicated that EDA-preserved sentence semantics, suggesting augmented sentences typically retained their true class labels.

Ablation Studies

Detailed ablation studies revealed the contribution of each EDA operation to performance gains:

Synonym Replacement (SR): Effective at low augmentation levels, declines at higher values.
Random Insertion (RI): Stable gains across a range of augmentation levels.
Random Swap (RS): High gains at low levels, diminishing returns as swaps increase.
Random Deletion (RD): Most effective at low deletion levels, substantial performance loss at higher levels.

Empirical evidence suggests that an augmentation factor ( $\alpha$ ) of 0.1 provides a balance between efficacy and maintaining data integrity.

Practical Recommendations

The paper provides practical parameters for deploying EDA based on dataset size:

Smaller datasets benefit from higher augmentation rates.
Parameters such as $\alpha$ and the number of augmented sentences ( $n_{aug}$ ) are optimized for different data sizes to maximize performance gains.

Comparison with Related Work

EDA is positioned as an easy-to-implement approach compared to more complex techniques like variational auto-encoders (VAEs) and back-translation, which require additional models and external datasets. EDA's simplicity and independence from external datasets make it a versatile tool for various NLP tasks.

Discussion and Limitations

While EDA shows significant advantages for smaller datasets, its impact diminishes with large datasets or when using pre-trained models like BERT or ELMo. Additionally, comparing EDA's efficacy with related work presents challenges due to differing evaluation methodologies and datasets.

Conclusion

EDA contributes valuable insights into data augmentation for NLP, showcasing that simple operations can yield meaningful improvements in text classification tasks. These enhancements are especially critical for models trained on small datasets. While EDA may not represent the zenith of text augmentation techniques, its simplicity and effectiveness offer a robust baseline for future research endeavors.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Jason Wei (49 papers)
Kai Zou (24 papers)

Citations (1,789)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos