Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Augmentation Approaches in Natural Language Processing: A Survey (2110.01852v3)

Published 5 Oct 2021 in cs.CL, cs.AI, and cs.LG

Abstract: As an effective strategy, data augmentation (DA) alleviates data scarcity scenarios where deep learning techniques may fail. It is widely applied in computer vision then introduced to natural language processing and achieves improvements in many tasks. One of the main focuses of the DA methods is to improve the diversity of training data, thereby helping the model to better generalize to unseen testing data. In this survey, we frame DA methods into three categories based on the diversity of augmented data, including paraphrasing, noising, and sampling. Our paper sets out to analyze DA methods in detail according to the above categories. Further, we also introduce their applications in NLP tasks as well as the challenges. Some helpful resources are provided in the appendix.

Overview of Data Augmentation Approaches in Natural Language Processing

The paper "Data Augmentation Approaches in Natural Language Processing: A Survey" offers a comprehensive review of data augmentation (DA) methods applied within the domain of NLP. The primary objective of DA is to alleviate data scarcity by generating additional training samples from existing data, thereby improving model performance, particularly the model's ability to generalize to unseen data. The authors categorize the DA methods into three distinct groups based on their approach to diversity: paraphrasing-based methods, noising-based methods, and sampling-based methods. These categories reflect the intrinsic strategies by which augmentation is achieved, from semantic similarity to introducing variability and synthetic data generation.

Paraphrasing-Based Methods

Paraphrasing aims to create semantically similar data variations through transformations at various levels, including word, phrase, and sentence. The survey identifies several techniques under this category:

  • Thesauruses and Semantic Embeddings: Utilize pre-existing resources like WordNet or embeddings to replace words with synonyms or semantically similar counterparts.
  • LLMs: Leverage masked LLMs (e.g., BERT, RoBERTa) to predict and replace words within a context-sensitive framework.
  • Rules and Machine Translation: Apply linguistic rules or translation into another language and back to generate variability, enhancing the corpus through novel sentence structures.
  • Model Generation: Employ generative models (e.g., Seq2Seq models) trained on specific paraphrasing tasks to output diverse sentence variations.

Noising-Based Methods

These methods introduce controlled noise to the data, fostering robustness through exposure to less predictable inputs. Techniques include:

  • Swapping, Deletion, and Insertion: Manipulate the structure by reordering, removing, or adding components, which challenges the model's dependency on specific sequences.
  • Substitution: Replace elements with others that might initially seem incorrect or out of context, testing the model's robustness against common errors or noises.

Sampling-Based Methods

Sampling encompasses methods that generate data based on learned distributions, enhancing the diversity and extensiveness of the training set:

  • Rules-Based Generation: Apply heuristic-driven transformations to produce new samples consistent with the task's constraints.
  • Model-Based Approaches: Non-pretrained and pretrained models are used to generate outputs directly from learned data distributions, with more emphasis on capturing and representing novel semantic spaces.
  • Self-training and Mixup: Utilize unlabeled data to bolster labeled datasets, or blend existing samples in vector space to create synthetic examples.

Applications and Implications

The application of these DA methods spans various NLP tasks, including text classification, generation, and structured prediction, each benefiting differently based on method characteristics. For instance, text classification favors simple and quick-to-deploy methods, while generation tasks leverage more diverse sampling strategies. Structured prediction relies on the syntactic consistency provided by paraphrasing methods.

The paper highlights several opportunities and ongoing challenges in the field of DA for NLP:

  • Theoretical understanding of DA's effects on model training and generalization is limited and requires further probe.
  • In the context of large-scale pretrained models, exploring efficient and effective DA methods remains an open question.
  • Few-shot learning scenarios invite innovative DA strategies to bootstrap model training.

Future Directions

Future research is encouraged to develop DA methods that are more generalized across different NLP tasks, accommodating the intrinsic diversity and complexity of language. Further advancements might integrate retrieval-based information to dynamically adapt augmentation processes, leveraging the robustness of adaptive LLMs. Additionally, extending DA frameworks to longer texts and lesser-resourced languages is anticipated to garner significant attention, given the heightened need for adaptable models in diverse linguistic environments.

In conclusion, this survey underscores the significance of DA in enhancing NLP models' capabilities, providing a detailed taxonomy and analysis of current methodologies while posing pertinent inquiries for ongoing and future research in the field.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Bohan Li (87 papers)
  2. Yutai Hou (23 papers)
  3. Wanxiang Che (152 papers)
Citations (234)