Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Exploring Data Augmentation Methods on Social Media Corpora (2303.02198v1)

Published 3 Mar 2023 in cs.CL

Abstract: Data augmentation has proven widely effective in computer vision. In NLP data augmentation remains an area of active research. There is no widely accepted augmentation technique that works well across tasks and model architectures. In this paper we explore data augmentation techniques in the context of text classification using two social media datasets. We explore popular varieties of data augmentation, starting with oversampling, Easy Data Augmentation (Wei and Zou, 2019) and Back-Translation (Sennrich et al., 2015). We also consider Greyscaling, a relatively unexplored data augmentation technique that seeks to mitigate the intensity of adjectives in examples. Finally, we consider a few-shot learning approach: Pattern-Exploiting Training (PET) (Schick et al., 2020). For the experiments we use a BERT transformer architecture. Results show that augmentation techniques provide only minimal and inconsistent improvements. Synonym replacement provided evidence of some performance improvement and adjective scales with Grayscaling is an area where further exploration would be valuable. Few-shot learning experiments show consistent improvement over supervised training, and seem very promising when classes are easily separable but further exploration would be valuable.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)