Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BootAug: Boosting Text Augmentation via Hybrid Instance Filtering Framework (2210.02941v2)

Published 6 Oct 2022 in cs.CL

Abstract: Text augmentation is an effective technique for addressing the problem of insufficient data in natural language processing. However, existing text augmentation methods tend to focus on few-shot scenarios and usually perform poorly on large public datasets. Our research indicates that existing augmentation methods often generate instances with shifted feature spaces, which leads to a drop in performance on the augmented data (for example, EDA generally loses $\approx 2\%$ in aspect-based sentiment classification). To address this problem, we propose a hybrid instance-filtering framework (BootAug) based on pre-trained LLMs that can maintain a similar feature space with natural datasets. BootAug is transferable to existing text augmentation methods (such as synonym substitution and back translation) and significantly improves the augmentation performance by $\approx 2-3\%$ in classification accuracy. Our experimental results on three classification tasks and nine public datasets show that BootAug addresses the performance drop problem and outperforms state-of-the-art text augmentation methods. Additionally, we release the code to help improve existing augmentation methods on large datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Data augmentation for text generation without any augmented data. In ACL/IJCNLP’21: Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 2223–2237. Association for Computational Linguistics.
  2. Using back-and-forth translation to create artificial augmented textual data for sentiment analysis models. Expert Syst. Appl., 178:115033.
  3. A large annotated corpus for learning natural language inference. In EMNLP’15: Proc. of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642. The Association for Computational Linguistics.
  4. Language models are few-shot learners. In NeurlIPS’20: Advances in Neural Information Processing Systems.
  5. Neural data-to-text generation with lm-based text augmentation. In EACL’21: Proc. of the 16th Conference of the European Chapter of the Association for Computational Linguistics, pages 758–768. Association for Computational Linguistics.
  6. Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. In ACL’20: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2147–2157. Association for Computational Linguistics.
  7. Stanley F. Chen and Joshua Goodman. 1999. An empirical study of smoothing techniques for language modeling. Comput. Speech Lang., 13(4):359–393.
  8. Claude Coulombe. 2018. Text data augmentation made simple by leveraging NLP cloud apis. CoRR, abs/1812.04718.
  9. Does syntax matter? A strong baseline for aspect-based sentiment analysis with roberta. In NAACL-HLT’21: Proc. of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1816–1829. Association for Computational Linguistics.
  10. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT’19: Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4171–4186. Association for Computational Linguistics.
  11. Data augmentation with adversarial training for cross-lingual NLI. In ACL/IJCNLP’21: Proc. of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 5158–5167. Association for Computational Linguistics.
  12. Ronald L. Graham. 1972. An efficient algorithm for determining the convex hull of a finite planar set. Inf. Process. Lett., 1(4):132–133.
  13. Text data augmentations: Permutation, antonyms and negation. Expert Syst. Appl., 177:114769.
  14. Deberta: decoding-enhanced bert with disentangled attention. In ICLR’21: 9th International Conference on Learning Representations. OpenReview.net.
  15. A challenge dataset and effective models for aspect-based sentiment analysis. In EMNLP-IJCNLP’19: Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 6279–6284. Association for Computational Linguistics.
  16. When chosen wisely, more data is what you need: A universal sample-efficient strategy for data augmentation. In ACL’22: Findings of the Association for Computational Linguistics, pages 1048–1062. Association for Computational Linguistics.
  17. AEDA: an easier data augmentation technique for text classification. In EMNLP’21: Findings of the Association for Computational Linguistics, pages 2748–2754. Association for Computational Linguistics.
  18. ALP: data augmentation using lexicalized pcfgs for few-shot text classification. In AAAI’22: Thirty-Sixth AAAI Conference on Artificial Intelligence, pages 10894–10902. AAAI Press.
  19. Submodular optimization-based diverse paraphrasing and its effectiveness in data augmentation. In NAACL-HLT’19: Proc. of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3609–3619. Association for Computational Linguistics.
  20. Data augmentation using pre-trained transformer models. CoRR, abs/2003.02245.
  21. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In ACL’20: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880. Association for Computational Linguistics.
  22. PAQ: 65 million probably-asked questions and what you can do with them. Trans. Assoc. Comput. Linguistics, 9:1098–1115.
  23. Textbugger: Generating adversarial text against real-world applications. In NDSS’19: 26th Annual Network and Distributed System Security Symposium. The Internet Society.
  24. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
  25. Don’t miss the labels: Label-semantic augmented meta-learner for few-shot text classification. In ACL/IJCNLP’21: Findings of the Association for Computational Linguistics, volume ACL/IJCNLP 2021, pages 2773–2782. Association for Computational Linguistics.
  26. Rotom: A meta-learned data augmentation framework for entity matching, data cleaning, text classification, and beyond. In SIGMOD’21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, pages 1303–1316. ACM.
  27. Nikolaos Mittas and Lefteris Angelis. 2013. Ranking and clustering software cost estimation models through a multiple comparisons algorithm. IEEE Trans. Software Eng., 39(4):537–551.
  28. SSMBA: self-supervised manifold based data augmentation for improving out-of-domain robustness. In EMNLP’20: Proc. of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 1268–1283. Association for Computational Linguistics.
  29. Tong Niu and Mohit Bansal. 2018. Adversarial over-sensitivity and over-stability strategies for dialogue models. In CoNLL’18: Proc. of the 22nd Conference on Computational Natural Language Learning, pages 486–496. Association for Computational Linguistics.
  30. Minh Hieu Phan and Philip O. Ogunbona. 2020. Modelling context and syntactical features for aspect-based sentiment analysis. In ACL’20: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3211–3220. Association for Computational Linguistics.
  31. Semeval-2016 task 5: Aspect based sentiment analysis. In NAACL-HLT’16: Proc. of the 10th International Workshop on Semantic Evaluation, pages 19–30. The Association for Computer Linguistics.
  32. Semeval-2015 task 12: Aspect based sentiment analysis. In NAACL-HLT’15: Proc. of the 9th International Workshop on Semantic Evaluation, pages 486–495. The Association for Computer Linguistics.
  33. Semeval-2014 task 4: Aspect based sentiment analysis. In ACL’14: Proc. of the 8th International Workshop on Semantic Evaluation, pages 27–35. The Association for Computer Linguistics.
  34. Text autoaugment: Learning compositional augmentation policy for text classification. In EMNLP’21: Proc. of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9029–9043. Association for Computational Linguistics.
  35. Rico Sennrich. 2012. Perplexity minimization for translation model domain adaptation in statistical machine translation. In EACL’12: 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 539–549. The Association for Computer Linguistics.
  36. Improving neural machine translation models with monolingual data. In ACL’16: Proc. of the 54th Annual Meeting of the Association for Computational Linguistics. The Association for Computer Linguistics.
  37. Better robustness by more coverage: Adversarial and mixup data augmentation for robust finetuning. In ACL/IJCNLP’21: Findings of the Association for Computational Linguistics, volume ACL/IJCNLP 2021 of Findings of ACL, pages 1569–1576. Association for Computational Linguistics.
  38. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP’13: Proc. of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642. ACL.
  39. Logic-driven context extension and data augmentation for logical reasoning of text. In ACL’22: Findings of the Association for Computational Linguistics, pages 1619–1629. Association for Computational Linguistics.
  40. Attention-based LSTM for aspect-level sentiment classification. In EMNLP’16: Proc. of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 606–615. The Association for Computational Linguistics.
  41. Promda: Prompt-based data augmentation for low-resource NLU tasks. In ACL’22: Proc. of the 60th Annual Meeting of the Association for Computational Linguistics, pages 4242–4255. Association for Computational Linguistics.
  42. Jason W. Wei and Kai Zou. 2019. EDA: easy data augmentation techniques for boosting performance on text classification tasks. In EMNLP-IJCNLP’19: Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 6381–6387. Association for Computational Linguistics.
  43. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL’18: Proc. of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics, pages 1112–1122. Association for Computational Linguistics.
  44. Text smoothing: Enhance various data augmentation methods on text classification tasks. In ACL’22: Proc. of the 60th Annual Meeting of the Association for Computational Linguistics, pages 871–875. Association for Computational Linguistics.
  45. Unsupervised data augmentation for consistency training. In NeurIPS’20: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems.
  46. Addressing resource and privacy constraints in semantic parsing through data augmentation. In ACL’22: Findings of the Association for Computational Linguistics, pages 3685–3695. Association for Computational Linguistics.
  47. Gpt3mix: Leveraging large-scale language models for text augmentation. In EMNLP’21: Findings of the Association for Computational Linguistics, pages 2225–2239. Association for Computational Linguistics.
  48. Improving chinese grammatical error detection via data augmentation by conditional error generation. In ACL’22: Findings of the Association for Computational Linguistics, pages 2966–2975. Association for Computational Linguistics.
  49. Aspect-based sentiment classification with aspect-specific graph convolutional networks. In EMNLP-IJCNLP’19: Proc. of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 4567–4577. Association for Computational Linguistics.
  50. Character-level convolutional networks for text classification. In NeurlIPS’15: Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems, pages 649–657.
  51. Flipda: Effective and robust data augmentation for few-shot learning. In Proc. of the 60th Annual Meeting of the Association for Computational Linguistics, pages 8646–8665. Association for Computational Linguistics.
  52. MELM: data augmentation with masked entity language modeling for low-resource NER. In ACL’22: Proc. of the 60th Annual Meeting of the Association for Computational Linguistics, pages 2251–2262. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Heng Yang (72 papers)
  2. Ke Li (722 papers)
Citations (4)