Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

An Empirical Survey of Data Augmentation for Limited Data Learning in NLP (2106.07499v1)

Published 14 Jun 2021 in cs.CL and cs.AI

Abstract: NLP has achieved great progress in the past decade through the use of neural models and large labeled datasets. The dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks where significant time, money, or expertise is required to label massive amounts of textual data. Recently, data augmentation methods have been explored as a means of improving data efficiency in NLP. To date, there has been no systematic empirical overview of data augmentation for NLP in the limited labeled data setting, making it difficult to understand which methods work in which settings. In this paper, we provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting, summarizing the landscape of methods (including token-level augmentations, sentence-level augmentations, adversarial augmentations, and hidden-space augmentations) and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks. Based on the results, we draw several conclusions to help practitioners choose appropriate augmentations in different settings and discuss the current challenges and future directions for limited data learning in NLP.

An Empirical Survey of Data Augmentation for Limited Data Learning in NLP

This paper presents a comprehensive empirical survey of data augmentation methods applied within the scope of NLP, particularly focusing on scenarios with limited labeled data. Recent advancements in deep learning have significantly improved NLP model performance, yet these models often rely on large amounts of labeled data. In contexts where data is scarce or expensive to label, data augmentation offers a means to enhance learning efficiency by artificially expanding the dataset through various transformation techniques.

Overview of Data Augmentation Methods

The survey categorizes data augmentation techniques into four main classes, each with unique approaches to increase dataset size and diversity while maintaining label integrity:

  1. Token-Level Augmentation: This involves manipulating individual tokens in a sentence, typically through synonym replacement, LLM-based replacements, or operations like random insertion, deletion, and swapping. These modifications aim to preserve the semantic meaning of the text.
  2. Sentence-Level Augmentation: Techniques at this level generate new sentences while retaining meaning, primarily through paraphrasing and conditional generation methodologies. Paraphrasing can be performed via round-trip translation or neural models tailored for sentence restructuring.
  3. Adversarial Data Augmentation: Both white-box and black-box attacks can perturb input data in ways that challenge a model's predictions, thereby creating augmented samples for robust training.
  4. Hidden-Space Augmentation: Interventions occur in the model’s internal representations either by perturbation of token embeddings or by mixing hidden representations of different samples to create augmented data.

Empirical Experiments and Findings

The empirical evaluation included several augmentation techniques across diverse NLP tasks, such as news and topic classification, inference, paraphrase detection, and sentiment analysis using models trained with BERT. The experimentation leveraged datasets with severely limited labeled instances, illustrating the efficacy of augmentation methods in both supervised and semi-supervised settings.

Key findings from the experiments highlighted that:

  • Token-level augmentations generally perform well for supervised learning with extremely limited data.
  • Sentence-level augmentations, particularly those involving paraphrasing or translation, show consistent gains in semi-supervised settings.
  • No single augmentation method universally excels across all tasks; therefore, task-specific augmentation choices are critical.
  • Adversarial and hidden-space augmentations can enhance robustness, but may not always translate to improved accuracy.

Future Directions and Challenges

The paper outlines future research pathways, including the development of data augmentation strategies that automatically adapt to specific tasks and datasets, facilitating more robust and generalizable NLP models. Incorporating theoretical guarantees to ensure augmentation methods preserve labels and do not alter data distributions is a key challenge identified for future work.

Additionally, exploring augmentation in conjunction with semi-supervised learning frameworks, like consistency training, continues to present fertile ground for enhancing NLP capabilities in low-resource settings.

The contributions of this survey offer a valuable resource for practitioners and researchers seeking to harness data augmentation techniques within NLP, providing insights into their diverse applications and impact on model performance in data-limited scenarios.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jiaao Chen (31 papers)
  2. Derek Tam (10 papers)
  3. Colin Raffel (83 papers)
  4. Mohit Bansal (304 papers)
  5. Diyi Yang (151 papers)
Citations (151)
Youtube Logo Streamline Icon: https://streamlinehq.com