A Survey of Data Augmentation Approaches for NLP (2105.03075v5)

Published 7 May 2021 in cs.CL, cs.AI, and cs.LG

Abstract: Data augmentation has recently seen increased interest in NLP due to more work in low-resource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data. In this paper, we present a comprehensive and unifying survey of data augmentation for NLP by summarizing the literature in a structured manner. We first introduce and motivate data augmentation for NLP, and then discuss major methodologically representative approaches. Next, we highlight techniques that are used for popular NLP applications and tasks. We conclude by outlining current challenges and directions for future research. Overall, our paper aims to clarify the landscape of existing literature in data augmentation for NLP and motivate additional work in this area. We also present a GitHub repository with a paper list that will be continuously updated at https://github.com/styfeng/DataAug4NLP

PDF Abstract

Data Augmentation Approaches in NLP: A Comprehensive Survey

The paper under review presents a detailed survey on data augmentation (DA) techniques specifically applied to NLP tasks. As the interest in NLP grows, driven by the emergence of extensive pretrained models and the expansion into diverse, low-resource domains, DA becomes a pivotal mechanism to enhance model training without the need for new data collection. This survey systematically categorizes existing DA methodologies, focusing on their applicability across various NLP applications and tasks.

Background and Motivation

Data augmentation in ML aims to create diverse training datasets, enhancing model generalization by introducing synthetic variations of existing data. While DA is well-established in computer vision, its application in NLP poses unique challenges due to the discrete nature of language data. Despite these challenges, DA in NLP is gaining traction, especially for expanding the capabilities of models in low-resource scenarios.

Methodological Breakdown

The paper categorizes DA techniques into three primary methodologies: rule-based, example interpolation, and model-based techniques.

Rule-Based Techniques: These methods are straightforward, involving predetermined transformations such as token-level manipulations (e.g., synonym replacement, random token swaps). While easily implementable, their performance improvements are often incremental.
Example Interpolation Techniques: Inspired by MixUp, which originated in computer vision, these techniques interpolate between examples in the feature space. NLP adaptations include embedding space interpolations to handle discrete text inputs effectively.
Model-Based Techniques: These techniques employ models, such as seq2seq and LLMs, to generate new examples. This includes backtranslation, leveraging pretrained models to create augmented data that may offer substantial performance gains.

Applications in NLP

The paper explores DA applications in various NLP scenarios, providing insights into its role in boosting performance across different tasks:

Low-Resource Languages: DA can leverage high-resource linguistic properties to enhance translation models, using techniques like backtranslation for generating synthetic parallel data.
Bias Mitigation: Counterfactual data augmentation techniques can address biases in model outputs, especially in gender bias scenarios.
Class Imbalance Correction: Techniques like SMOTE create balanced datasets by interpolating between minority class samples, improving classification outcomes.
Few-Shot Learning: DA can simulate additional examples for novel classes, promoting better generalization in training models with limited data.

Challenges and Future Directions

The paper highlights several challenges and future directions for DA in NLP:

Theoretical Grounding: While DA is empirically effective, there is a noticeable gap in understanding the underlying principles governing its success in NLP settings.
Pretrained Model Integration: The impact of DA on pretrained models, particularly on in-domain versus out-of-domain tasks, needs further investigation to optimize their utilization.
Multimodal Augmentation: Coordinated augmentation across multiple modalities, such as text and image, remains an open area, with potential methodologies derived from computer vision innovations.
Domain-Specific Challenges: Applying DA to specialized fields like medical texts or low-resource languages without extensive pretrained resources presents unique difficulties.

Conclusion

Overall, the survey serves as a comprehensive guide, synthesizing the current landscape of DA methods in NLP and paving the way for further research. Continued exploration and theoretical advancements in this area could unlock new potentials for adapting models to diverse and resource-constrained environments. The provided GitHub repository is a valuable resource for ongoing updates and community engagement in this evolving field.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Steven Y. Feng (13 papers)
Varun Gangal (28 papers)
Jason Wei (49 papers)
Sarath Chandar (93 papers)
Soroush Vosoughi (90 papers)
Teruko Mitamura (26 papers)
Eduard Hovy (115 papers)

Citations (721)

View on Semantic Scholar