On Evaluation Protocols for Data Augmentation in a Limited Data Scenario (2402.14895v2)
Abstract: Textual data augmentation (DA) is a prolific field of study where novel techniques to create artificial data are regularly proposed, and that has demonstrated great efficiency on small data settings, at least for text classification tasks. In this paper, we challenge those results, showing that classical data augmentation (which modify sentences) is simply a way of performing better fine-tuning, and that spending more time doing so before applying data augmentation negates its effect. This is a significant contribution as it answers several questions that were left open in recent years, namely~: which DA technique performs best (all of them as long as they generate data close enough to the training set, as to not impair training) and why did DA show positive results (facilitates training of network). We further show that zero- and few-shot DA via conversational agents such as ChatGPT or LLama2 can increase performances, confirming that this form of data augmentation is preferable to classical methods.
- Data Augmentation for Improving Explainability of Hate Speech Detection. Arabian Journal for Science and Engineering, pages 1–13. Publisher: Springer.
- Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers. International journal of machine learning and cybernetics, 14(1):135–150. Publisher: Springer.
- Hanjie Chen and Yangfeng Ji. 2020. Improving the Explainability of Neural Sentiment Classifiers via Data Augmentation. ArXiv:1909.04225 [cs].
- An Empirical Survey of Data Augmentation for Limited Data Learning in NLP. arXiv:2106.07499 [cs]. ArXiv: 2106.07499.
- Jean-Philippe Corbeil and Hadi Abdi Ghadivel. 2020. BET: A Backtranslation Approach for Easy Data Augmentation in Transformer-based Paraphrase Identification Context. ArXiv:2009.12452 [cs].
- Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces. ArXiv:1805.10190 [cs].
- Claude Coulombe. 2018. Text Data Augmentation Made Simple By Leveraging NLP Cloud APIs. ArXiv:1812.04718 [cs].
- Understanding Back-Translation at Scale. ArXiv:1808.09381 [cs].
- ChatGPT as Data Augmentation for Compositional Generalization: A Case Study in Open Intent Detection. ArXiv:2308.13517 [cs].
- A Survey of Data Augmentation Approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968–988, Online. Association for Computational Linguistics.
- Data Augmentation for Low-Resource Keyphrase Generation. ArXiv:2305.17968 [cs].
- Is ChatGPT a Financial Expert? Evaluating Language Models on Financial Natural Language Processing. ArXiv:2310.12664 [cs].
- Back-Translation-Style Data Augmentation for end-to-end ASR. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 426–433.
- AEDA: An Easier Data Augmentation Technique for Text Classification. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2748–2754, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- ALP: Data Augmentation Using Lexicalized PCFGs for Few-Shot Text Classification. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):10894–10902. Number: 10.
- Sosuke Kobayashi. 2018. Contextual Augmentation: Data Augmentation by Words with Paradigmatic Relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 452–457, New Orleans, Louisiana. Association for Computational Linguistics.
- Data Augmentation using Pre-trained Transformer Models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems, pages 18–26, Suzhou, China. Association for Computational Linguistics.
- Data Augmentation using Pre-trained Transformer Models. ArXiv:2003.02245 [cs].
- Xin Li and Dan Roth. 2002. Learning question classifiers. In Proceedings of the 19th international conference on Computational linguistics -, volume 1, pages 1–7, Taipei, Taiwan. Association for Computational Linguistics.
- Data augmentation in a hybrid approach for aspect-based sentiment analysis. In Proceedings of the 36th Annual ACM Symposium on Applied Computing, SAC ’21, pages 828–835, New York, NY, USA. Association for Computing Machinery.
- Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9031–9041, Online. Association for Computational Linguistics.
- Radiology-Llama2: Best-in-Class Large Language Model for Radiology. ArXiv:2309.06419 [cs].
- Vukosi Marivate and Tshephisho Sefara. 2020. Improving Short Text Classification Through Global Augmentation Methods. In Machine Learning and Knowledge Extraction, Lecture Notes in Computer Science, pages 385–399, Cham. Springer International Publishing.
- Training Data Augmentation for Detecting Adverse Drug Reactions in User-Generated Content. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2349–2359, Hong Kong, China. Association for Computational Linguistics.
- Is a prompt and a few samples all you need? Using GPT-4 for data augmentation in low-resource classification tasks. ArXiv:2304.13861 [physics].
- H. Nishizaki. 2017. Data augmentation and feature extraction using variational autoencoder for acoustic modeling. In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 1222–1227.
- Data Augmentation with Paraphrase Generation and Entity Extraction for Multimodal Dialogue System. ArXiv:2205.04006 [cs].
- Frédéric Piedboeuf and Philippe Langlais. 2022. Effective Data Augmentation for Sentence Classification Using One VAE per Class. In Proceedings of the 29th International Conference on Computational Linguistics, pages 3454–3464, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Frédéric Piedboeuf and Philippe Langlais. 2023. Is ChatGPT the ultimate Data Augmentation Algorithm. In Findings of the Association for Computational Linguistics: EMNLP 2023.
- Three Ways of Using Large Language Models to Evaluate Chat. ArXiv:2308.06502 [cs].
- EasyAug: An Automatic Textual Data Augmentation Platform for Classification Tasks. In Companion Proceedings of the Web Conference 2020, pages 249–252. Association for Computing Machinery, New York, NY, USA.
- Hugo Queiroz Abonizio and Sylvio Barbon Junior. 2020. Pre-trained Data Augmentation for Text Classification. In Intelligent Systems, Lecture Notes in Computer Science, pages 551–565, Cham. Springer International Publishing.
- KPDROP: Improving Absent Keyphrase Generation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4853–4870, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Data Augmentation for Intent Classification with Off-the-shelf Large Language Models. ArXiv:2204.01959 [cs].
- Yves Scherrer. 2020. TaPaCo: A Corpus of Sentential Paraphrases for 73 Languages.
- Ashwyn Sharma and David I Feldman. 2023. Team Cadence at MEDIQA-Sum 2023: Using ChatGPT as a Data Augmentation Tool for Classifying Clinical Dialogue. CLEF.
- Elena Shushkevich and John Cardiff. 2023. Tudublin at CheckThat! 2023: Chatgpt for data augmentation. Working Notes of CLEF.
- Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
- Rethinking the Inception Architecture for Computer Vision. ArXiv:1512.00567 [cs].
- ZeroShotDataAug: Generating and Augmenting Training Data with ChatGPT. ArXiv:2304.14334 [cs].
- SemEval-2018 Task 3: Irony Detection in English Tweets. In Proceedings of The 12th International Workshop on Semantic Evaluation, pages 39–50, New Orleans, Louisiana. Association for Computational Linguistics.
- Jason Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6382–6388, Hong Kong, China. Association for Computational Linguistics.
- Conditional BERT Contextual Augmentation. In Computational Science – ICCS 2019, Lecture Notes in Computer Science, pages 84–95, Cham. Springer International Publishing.
- Generative Data Augmentation for Commonsense Reasoning. arXiv:2004.11546 [cs]. ArXiv: 2004.11546.
- Data Augmentation for Voice-Assistant NLU using BERT-based Interchangeable Rephrase. arXiv:2104.08268 [cs]. ArXiv: 2104.08268.
- GPT3Mix: Leveraging Large-scale Language Models for Text Augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2225–2239, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension. arXiv:1804.09541 [cs]. ArXiv: 1804.09541.
- Weaker Than You Think: A Critical Look at Weakly Supervised Learning. ArXiv:2305.17442 [cs].
- Frédéric Piedboeuf (3 papers)
- Philippe Langlais (23 papers)