An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
This paper presents a comprehensive empirical survey of data augmentation methods applied within the scope of NLP, particularly focusing on scenarios with limited labeled data. Recent advancements in deep learning have significantly improved NLP model performance, yet these models often rely on large amounts of labeled data. In contexts where data is scarce or expensive to label, data augmentation offers a means to enhance learning efficiency by artificially expanding the dataset through various transformation techniques.
Overview of Data Augmentation Methods
The survey categorizes data augmentation techniques into four main classes, each with unique approaches to increase dataset size and diversity while maintaining label integrity:
- Token-Level Augmentation: This involves manipulating individual tokens in a sentence, typically through synonym replacement, LLM-based replacements, or operations like random insertion, deletion, and swapping. These modifications aim to preserve the semantic meaning of the text.
- Sentence-Level Augmentation: Techniques at this level generate new sentences while retaining meaning, primarily through paraphrasing and conditional generation methodologies. Paraphrasing can be performed via round-trip translation or neural models tailored for sentence restructuring.
- Adversarial Data Augmentation: Both white-box and black-box attacks can perturb input data in ways that challenge a model's predictions, thereby creating augmented samples for robust training.
- Hidden-Space Augmentation: Interventions occur in the model’s internal representations either by perturbation of token embeddings or by mixing hidden representations of different samples to create augmented data.
Empirical Experiments and Findings
The empirical evaluation included several augmentation techniques across diverse NLP tasks, such as news and topic classification, inference, paraphrase detection, and sentiment analysis using models trained with BERT. The experimentation leveraged datasets with severely limited labeled instances, illustrating the efficacy of augmentation methods in both supervised and semi-supervised settings.
Key findings from the experiments highlighted that:
- Token-level augmentations generally perform well for supervised learning with extremely limited data.
- Sentence-level augmentations, particularly those involving paraphrasing or translation, show consistent gains in semi-supervised settings.
- No single augmentation method universally excels across all tasks; therefore, task-specific augmentation choices are critical.
- Adversarial and hidden-space augmentations can enhance robustness, but may not always translate to improved accuracy.
Future Directions and Challenges
The paper outlines future research pathways, including the development of data augmentation strategies that automatically adapt to specific tasks and datasets, facilitating more robust and generalizable NLP models. Incorporating theoretical guarantees to ensure augmentation methods preserve labels and do not alter data distributions is a key challenge identified for future work.
Additionally, exploring augmentation in conjunction with semi-supervised learning frameworks, like consistency training, continues to present fertile ground for enhancing NLP capabilities in low-resource settings.
The contributions of this survey offer a valuable resource for practitioners and researchers seeking to harness data augmentation techniques within NLP, providing insights into their diverse applications and impact on model performance in data-limited scenarios.