BootAug: Boosting Text Augmentation via Hybrid Instance Filtering Framework (2210.02941v2)
Abstract: Text augmentation is an effective technique for addressing the problem of insufficient data in natural language processing. However, existing text augmentation methods tend to focus on few-shot scenarios and usually perform poorly on large public datasets. Our research indicates that existing augmentation methods often generate instances with shifted feature spaces, which leads to a drop in performance on the augmented data (for example, EDA generally loses $\approx 2\%$ in aspect-based sentiment classification). To address this problem, we propose a hybrid instance-filtering framework (BootAug) based on pre-trained LLMs that can maintain a similar feature space with natural datasets. BootAug is transferable to existing text augmentation methods (such as synonym substitution and back translation) and significantly improves the augmentation performance by $\approx 2-3\%$ in classification accuracy. Our experimental results on three classification tasks and nine public datasets show that BootAug addresses the performance drop problem and outperforms state-of-the-art text augmentation methods. Additionally, we release the code to help improve existing augmentation methods on large datasets.
- Data augmentation for text generation without any augmented data. In ACL/IJCNLP’21: Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 2223–2237. Association for Computational Linguistics.
- Using back-and-forth translation to create artificial augmented textual data for sentiment analysis models. Expert Syst. Appl., 178:115033.
- A large annotated corpus for learning natural language inference. In EMNLP’15: Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642. The Association for Computational Linguistics.
- Language models are few-shot learners. In NeurlIPS’20: Advances in Neural Information Processing Systems.
- Neural data-to-text generation with lm-based text augmentation. In EACL’21: Proc. of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 758–768. Association for Computational Linguistics.
- Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. In ACL’20: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2147–2157. Association for Computational Linguistics.
- Stanley F. Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Comput. Speech Lang., 13(4):359–393.
- Claude Coulombe. 2018. Text data augmentation made simple by leveraging NLP cloud apis. CoRR, abs/1812.04718.
- Does syntax matter? A strong baseline for aspect-based sentiment analysis with roberta. In NAACL-HLT’21: Proc. of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1816–1829. Association for Computational Linguistics.
- BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT’19: Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186. Association for Computational Linguistics.
- Data augmentation with adversarial training for cross-lingual NLI. In ACL/IJCNLP’21: Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 5158–5167. Association for Computational Linguistics.
- Ronald L. Graham. 1972. An efficient algorithm for determining the convex hull of a finite planar set. Inf. Process. Lett., 1(4):132–133.
- Text data augmentations: Permutation, antonyms and negation. Expert Syst. Appl., 177:114769.
- Deberta: decoding-enhanced bert with disentangled attention. In ICLR’21: 9th International Conference on Learning Representations. OpenReview.net.
- A challenge dataset and effective models for aspect-based sentiment analysis. In EMNLP-IJCNLP’19: Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 6279–6284. Association for Computational Linguistics.
- When chosen wisely, more data is what you need: A universal sample-efficient strategy for data augmentation. In ACL’22: Findings of the Association for Computational Linguistics, pages 1048–1062. Association for Computational Linguistics.
- AEDA: an easier data augmentation technique for text classification. In EMNLP’21: Findings of the Association for Computational Linguistics, pages 2748–2754. Association for Computational Linguistics.
- ALP: data augmentation using lexicalized pcfgs for few-shot text classification. In AAAI’22: Thirty-Sixth AAAI Conference on Artificial Intelligence, pages 10894–10902. AAAI Press.
- Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation. In NAACL-HLT’19: Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3609–3619. Association for Computational Linguistics.
- Data augmentation using pre-trained transformer models. CoRR, abs/2003.02245.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL’20: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880. Association for Computational Linguistics.
- PAQ: 65 million probably-asked questions and what you can do with them. Trans. Assoc. Comput. Linguistics, 9:1098–1115.
- Textbugger: Generating adversarial text against real-world applications. In NDSS’19: 26th Annual Network and Distributed System Security Symposium. The Internet Society.
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
- Don’t miss the labels: Label-semantic augmented meta-learner for few-shot text classification. In ACL/IJCNLP’21: Findings of the Association for Computational Linguistics, volume ACL/IJCNLP 2021, pages 2773–2782. Association for Computational Linguistics.
- Rotom: A meta-learned data augmentation framework for entity matching, data cleaning, text classification, and beyond. In SIGMOD’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, pages 1303–1316. ACM.
- Nikolaos Mittas and Lefteris Angelis. 2013. Ranking and clustering software cost estimation models through a multiple comparisons algorithm. IEEE Trans. Software Eng., 39(4):537–551.
- SSMBA: self-supervised manifold based data augmentation for improving out-of-domain robustness. In EMNLP’20: Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 1268–1283. Association for Computational Linguistics.
- Tong Niu and Mohit Bansal. 2018. Adversarial over-sensitivity and over-stability strategies for dialogue models. In CoNLL’18: Proc. of the 22nd Conference on Computational Natural Language Learning, pages 486–496. Association for Computational Linguistics.
- Minh Hieu Phan and Philip O. Ogunbona. 2020. Modelling context and syntactical features for aspect-based sentiment analysis. In ACL’20: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3211–3220. Association for Computational Linguistics.
- Semeval-2016 task 5: Aspect based sentiment analysis. In NAACL-HLT’16: Proc. of the 10th International Workshop on Semantic Evaluation, pages 19–30. The Association for Computer Linguistics.
- Semeval-2015 task 12: Aspect based sentiment analysis. In NAACL-HLT’15: Proc. of the 9th International Workshop on Semantic Evaluation, pages 486–495. The Association for Computer Linguistics.
- Semeval-2014 task 4: Aspect based sentiment analysis. In ACL’14: Proc. of the 8th International Workshop on Semantic Evaluation, pages 27–35. The Association for Computer Linguistics.
- Text autoaugment: Learning compositional augmentation policy for text classification. In EMNLP’21: Proc. of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9029–9043. Association for Computational Linguistics.
- Rico Sennrich. 2012. Perplexity minimization for translation model domain adaptation in statistical machine translation. In EACL’12: 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 539–549. The Association for Computer Linguistics.
- Improving neural machine translation models with monolingual data. In ACL’16: Proc. of the 54th Annual Meeting of the Association for Computational Linguistics. The Association for Computer Linguistics.
- Better robustness by more coverage: Adversarial and mixup data augmentation for robust finetuning. In ACL/IJCNLP’21: Findings of the Association for Computational Linguistics, volume ACL/IJCNLP 2021 of Findings of ACL, pages 1569–1576. Association for Computational Linguistics.
- Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP’13: Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642. ACL.
- Logic-driven context extension and data augmentation for logical reasoning of text. In ACL’22: Findings of the Association for Computational Linguistics, pages 1619–1629. Association for Computational Linguistics.
- Attention-based LSTM for aspect-level sentiment classification. In EMNLP’16: Proc. of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 606–615. The Association for Computational Linguistics.
- Promda: Prompt-based data augmentation for low-resource NLU tasks. In ACL’22: Proc. of the 60th Annual Meeting of the Association for Computational Linguistics, pages 4242–4255. Association for Computational Linguistics.
- Jason W. Wei and Kai Zou. 2019. EDA: easy data augmentation techniques for boosting performance on text classification tasks. In EMNLP-IJCNLP’19: Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 6381–6387. Association for Computational Linguistics.
- A broad-coverage challenge corpus for sentence understanding through inference. In NAACL’18: Proc. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, pages 1112–1122. Association for Computational Linguistics.
- Text smoothing: Enhance various data augmentation methods on text classification tasks. In ACL’22: Proc. of the 60th Annual Meeting of the Association for Computational Linguistics, pages 871–875. Association for Computational Linguistics.
- Unsupervised data augmentation for consistency training. In NeurIPS’20: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems.
- Addressing resource and privacy constraints in semantic parsing through data augmentation. In ACL’22: Findings of the Association for Computational Linguistics, pages 3685–3695. Association for Computational Linguistics.
- Gpt3mix: Leveraging large-scale language models for text augmentation. In EMNLP’21: Findings of the Association for Computational Linguistics, pages 2225–2239. Association for Computational Linguistics.
- Improving chinese grammatical error detection via data augmentation by conditional error generation. In ACL’22: Findings of the Association for Computational Linguistics, pages 2966–2975. Association for Computational Linguistics.
- Aspect-based sentiment classification with aspect-specific graph convolutional networks. In EMNLP-IJCNLP’19: Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 4567–4577. Association for Computational Linguistics.
- Character-level convolutional networks for text classification. In NeurlIPS’15: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, pages 649–657.
- Flipda: Effective and robust data augmentation for few-shot learning. In Proc. of the 60th Annual Meeting of the Association for Computational Linguistics, pages 8646–8665. Association for Computational Linguistics.
- MELM: data augmentation with masked entity language modeling for low-resource NER. In ACL’22: Proc. of the 60th Annual Meeting of the Association for Computational Linguistics, pages 2251–2262. Association for Computational Linguistics.