Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Investigating Masking-based Data Generation in Language Models (2307.00008v1)

Published 16 Jun 2023 in cs.CL and cs.AI

Abstract: The current era of NLP has been defined by the prominence of pre-trained LLMs since the advent of BERT. A feature of BERT and models with similar architecture is the objective of masked LLMing, in which part of the input is intentionally masked and the model is trained to predict this piece of masked information. Data augmentation is a data-driven technique widely used in machine learning, including research areas like computer vision and natural language processing, to improve model performance by artificially augmenting the training data set by designated techniques. Masked LLMs (MLM), an essential training feature of BERT, have introduced a novel approach to perform effective pre-training on Transformer based models in natural language processing tasks. Recent studies have utilized masked LLM to generate artificially augmented data for NLP downstream tasks. The experimental results show that Mask based data augmentation method provides a simple but efficient approach to improve the model performance. In this paper, we explore and discuss the broader utilization of these data augmentation methods based on MLM.

Investigating Masking-based Data Generation in LLMs

The preeminent role of pre-trained LLMs (PLMs) in NLP is undeniable, with BERT-based architectures significantly altering the landscape. Central to these architectures is masked LLMing (MLM), a concept that trains models to predict intentionally masked portions of input sequences. "Investigating Masking-based Data Generation in LLMs" scrutinizes the utility of MLM for data augmentation in downstream NLP tasks—a practice growing in popularity for its ability to enhance model performance with artificially generated datasets.

Overview and Context

The paper acknowledges the increasing reliance on PLMs like BERT, RoBERTa, XLNet, BART, and T5, exploring their MLM principles extensively. The bidirectional context understanding inherent to these models enables superior comprehension of language nuances, facilitating unparalleled success across NLP tasks. The authors emphasize that high-quality annotated data is paramount for achieving notable outputs in machine learning models, reinforcing the idea that vital patterns and contextual cues in data directly influence model training outcomes.

However, obtaining volumes of annotated data remains a cost-intensive challenge, propelling explorations into cost-effective augmentation methods. Data augmentation, including rule-based and model-assisted techniques, aims to enrich training datasets, producing linguistically valid and sufficiently diverse instances to enhance model generalization and performance.

Masking and Data Augmentation

The focus on masking-based data augmentation springs from the fundamental trait of the MLM objective. The authors categorize data augmentation techniques into paraphrasing, noising, and sampling methods, evaluating each for its semantic fidelity, diversity, and efficacy within PLMs.

Masking-based data augmentation harnesses pre-trained masked LLMs to orchestrate fine-grained control over data manipulation. Unlike static transformations or paraphrase generation that may introduce artifacts unrepresentative of natural language distribution, masking-induced augmentations strategically apply probabilistic alterations that are maintained within known contextual inference bounds, as determined in models like BERT.

Implications and Forward-Looking Perspective

Analyzing results from existing methodologies, the authors infer that mask-based augmentation methods furnish straightforward, efficient strategies to improve NLP models' robustness and versatility. The practical applications, observed in improvements reflected across dialog act tagging and sentiment analysis, pinpoint potential expansions into more complex NLP scenarios.

Foreseeing broader implementations, the incorporation of mask-based augmentation within LLMs such as the GPT-3 framework could substantially amplify computational load but promise sophisticated generation capabilities. Adaptations of current paradigms to embrace these advanced frameworks could bridge gaps between supervised learning dependencies and the benefits of unsupervised, data-driven augmentation techniques.

Furthermore, emerging PLM architectures diverging from traditional MLM objectives signal a trajectory toward multifunctional, adaptive models. Applying mask-based strategies in these avant-garde architectures would likely yield diversified training experiences, ultimately riffling through to improved task variance handling and context-aware generation faculties.

Conclusion

The investigation conducted within this paper delineates a pathway for utilizing masked LLMs in data augmentation scenarios, imparting insights and methodologies crucial for researchers and practitioners steering the next generation of NLP systems. By balancing on-mask strategies with robust model architectures, the field may augment the practicality of these potent models in real-world applications, progressively driving toward ever more intelligent language understanding systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Do Not Have Enough Data? Deep Learning to the Rescue!. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, New York, NY, USA, 7383–7390.
  2. Regina Barzilay and Kathleen R. McKeown. 2001. Extracting Paraphrases from a Parallel Corpus. In Proceedings of ACL. Association for Computational Linguistics, Toulouse, France, 50–57. https://doi.org/10.3115/1073012.1073020
  3. Language Models Are Few-Shot Learners. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS’20). Curran Associates Inc., Red Hook, NY, USA, Article 159, 25 pages.
  4. MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling. In Proceedings of EMNLP. Association for Computational Linguistics, Brussels, Belgium, 5016–5026. https://doi.org/10.18653/v1/D18-1547
  5. Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 4516–4525. https://doi.org/10.18653/v1/D19-1459
  6. Local Additivity Based Data Augmentation for Semi-supervised NER. In arXiv:2010.01677. arXiv:2010.01677 [cs.CL]
  7. MixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 2147–2157. https://doi.org/10.18653/v1/2020.acl-main.194
  8. Robust Neural Machine Translation with Doubly Adversarial Inputs. In Proceedings of ACL. Association for Computational Linguistics, Florence, Italy, 4324–4333. https://doi.org/10.18653/v1/P19-1425
  9. AdvAug: Robust Adversarial Augmentation for Neural Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5961–5970. https://doi.org/10.18653/v1/2020.acl-main.529
  10. Scaling Instruction-Finetuned Language Models. In arXiv:2210.11416. arXiv:2210.11416 [cs.LG]
  11. Guillaume Daval-Frerot and Yannick Weis. 2020. WMD at SemEval-2020 Tasks 7 and 11: Assessing Humor and Propaganda Using Unsupervised Data Augmentation. In Proceedings of the Fourteenth Workshop on Semantic Evaluation. International Committee for Computational Linguistics, Barcelona (online), 1865–1874. https://doi.org/10.18653/v1/2020.semeval-1.246
  12. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  13. Self-training Improves Pre-training for Natural Language Understanding. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 5408–5418. https://doi.org/10.18653/v1/2021.naacl-main.426
  14. Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation. In Proc. of NAACL-HLT. Association for Computational Linguistics, Online, 704–717. https://doi.org/10.18653/v1/2021.naacl-main.57
  15. Mask-then-Fill: A Flexible and Effective Data Augmentation Framework for Event Extraction. In Findings of the Association for Computational Linguistics: EMNLP 2022. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 4537–4544.
  16. Augmenting Data with Mixup for Sentence Classification: An Empirical Study. In arXiv:1905.08941. arXiv:1905.08941
  17. NeuroCounterfactuals: Beyond Minimal-Edit Counterfactuals for Richer Data Augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2022. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5056–5072. https://aclanthology.org/2022.findings-emnlp.371
  18. Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, New York, NY, USA, 8018–8025. https://doi.org/10.1609/aaai.v34i05.6311
  19. DropMix: A Textual Data Augmentation Combining Dropout with Mixup. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 890–899.
  20. Data Augmentation using Pre-trained Transformer Models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems. Association for Computational Linguistics, Suzhou, China, 18–26.
  21. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703
  22. Data augmentation approaches in natural language processing: A survey. AI Open 3 (2022), 71–90. https://doi.org/10.1016/j.aiopen.2022.03.001
  23. TextBugger: Generating Adversarial Text Against Real-world Applications. In 26th Annual Network and Distributed System Security Symposium, NDSS 2019. The Internet Society.
  24. Qian Lin and Hwee Tou Ng. 2022. A Semi-supervised Learning Approach with Two Teachers to Improve Breakdown Identification in Dialogues. In Proceedings of the AAAI Conference on Artificial Intelligence. AAAI Press, Online, 11011–11019. https://doi.org/10.1609/aaai.v36i10.21349
  25. RoBERTa: A Robustly Optimized BERT Pretraining Approach. In arXiv:1907.11692. arXiv:1907.11692 [cs.CL]
  26. How Effective is Task-Agnostic Data Augmentation for Pretrained Transformers?. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4401–4411. https://doi.org/10.18653/v1/2020.findings-emnlp.394
  27. Unsupervised Data Augmentation with Naive Augmentation and without Unlabeled Data. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 4992–5001. https://doi.org/10.18653/v1/2021.emnlp-main.408
  28. Nitin Madnani and Bonnie J. Dorr. 2010. Generating Phrasal and Sentential Paraphrases: A Survey of Data-Driven Methods. Computational Linguistics 36, 3 (Sept. 2010), 341–387. https://doi.org/10.1162/coli_a_00002
  29. Twitter Data Augmentation for Monitoring Public Opinion on COVID-19 Intervention Measures. In Proceedings of the 1st Workshop on NLP for COVID-19 (Part 2) at EMNLP 2020. Association for Computational Linguistics, Online. https://doi.org/10.18653/v1/2020.nlpcovid19-2.19
  30. Denoising Pre-Training and Data Augmentation Strategies for Enhanced RDF Verbalization with Transformers. In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+). Association for Computational Linguistics, Dublin, Ireland (Virtual), 89–99.
  31. TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 119–126. https://doi.org/10.18653/v1/2020.emnlp-demos.16
  32. Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics, Berlin, Germany, 280–290. https://doi.org/10.18653/v1/K16-1028
  33. SSMBA: Self-Supervised Manifold Based Data Augmentation for Improving Out-of-Domain Robustness. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 1268–1283. https://doi.org/10.18653/v1/2020.emnlp-main.97
  34. Detecting Environmental, Social and Governance (ESG) Topics Using Domain-Specific Language Models and Data Augmentation. In Flexible Query Answering Systems, Troels Andreasen, Guy De Tré, Janusz Kacprzyk, Henrik Legind Larsen, Gloria Bordogna, and Sławomir Zadrożny (Eds.). Springer International Publishing, Cham, 157–169.
  35. OpenAI. 2022. Introducing ChatGPT. https://openai.com/blog/chatgpt/.
  36. Training language models to follow instructions with human feedback. In arXiv:2203.02155. arXiv:2203.02155 [cs.CL]
  37. Data Augmentation for Spoken Language Understanding via Pretrained Language Models. In Proc. Interspeech 2021. 1219–1223. https://doi.org/10.21437/Interspeech.2021-117
  38. Co{DA}: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding. In International Conference on Learning Representations. https://openreview.net/forum?id=Ozk9MrX1hvA
  39. Hugo Queiroz Abonizio and Sylvio Barbon Junior. 2020. Pre-trained Data Augmentation for Text Classification. In Intelligent Systems, Ricardo Cerri and Ronaldo C. Prati (Eds.). Springer International Publishing, Cham, 551–565.
  40. Textual Data Augmentation for Efficient Active Learning on Tiny Datasets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 7400–7410. https://doi.org/10.18653/v1/2020.emnlp-main.600
  41. Improving language understanding by generative pre-training.
  42. Language Models are Unsupervised Multitask Learners.
  43. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67.
  44. Fast Cross-domain Data Augmentation through Neural Sentence Editing. In arXiv:2003.10254. arXiv:2003.10254 [cs.CL]
  45. Know What You Don’t Know: Unanswerable Questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Melbourne, Australia, 784–789. https://doi.org/10.18653/v1/P18-2124
  46. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383–2392. https://doi.org/10.18653/v1/D16-1264
  47. Can We Achieve More with Less? Exploring Data Augmentation for Toxic Comment Classification. In arXiv:2007.00875. arXiv:2007.00875 [cs.CL]
  48. Text Data Augmentation: Towards better detection of spear-phishing emails. In arXiv:2007.02033. arXiv:2007.02033 [cs.CL]
  49. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982–3992. https://doi.org/10.18653/v1/D19-1410
  50. Sam Shleifer. 2019. Low Resource Text Classification with ULMFit and Backtranslation. In arXiv:1903.09244. arXiv:1903.09244 [cs.CL]
  51. Better Robustness by More Coverage: Adversarial and Mixup Data Augmentation for Robust Finetuning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 1569–1576. https://doi.org/10.18653/v1/2021.findings-acl.137
  52. Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 3436–3440. https://doi.org/10.18653/v1/2020.coling-main.305
  53. Deep Transformer based Data Augmentation with Subword Units for Morphologically Rich Online ASR. In arXiv:2007.06949. arXiv:2007.06949 [eess.AS]
  54. Augmented SBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks. In Proceedings of NAACL-HLT. Association for Computational Linguistics, Online, 296–310.
  55. LLaMA: Open and Efficient Foundation Language Models. arXiv:2302.13971 [cs.CL]
  56. Attention is All You Need. In Proceedings of NIPS (Long Beach, California, USA) (NIPS’17). Curran Associates Inc., Red Hook, NY, USA, 6000–6010.
  57. SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 856–861. https://doi.org/10.18653/v1/D18-1100
  58. Jason Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 6382–6388. https://doi.org/10.18653/v1/D19-1670
  59. Learning to Generalize to More: Continuous Semantic Augmentation for Neural Machine Translation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 7930–7944. https://doi.org/10.18653/v1/2022.acl-long.546
  60. Text Smoothing: Enhance Various Data Augmentation Methods on Text Classification Tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Dublin, Ireland, 871–875. https://doi.org/10.18653/v1/2022.acl-short.97
  61. Unsupervised Data Augmentation for Consistency Training. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada). Curran Associates Inc., Red Hook, NY, USA, Article 525, 13 pages.
  62. Data Noising as Smoothing in Neural Network Language Models. In Proceedings of 5th International Conference on Learning Representations, ICLR 2017. OpenReview.net, Toulon, France.
  63. Neural Retrieval for Question Answering with Cross-Attention Supervised Data Augmentation. In Proceedings of ACL-IJCNLP (Volume 2: Short Papers). Association for Computational Linguistics, Online, 263–268. https://doi.org/10.18653/v1/2021.acl-short.35
  64. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the 33rd International Conference on Neural Information Processing Systems. Curran Associates Inc., Red Hook, NY, USA, Article 517, 11 pages.
  65. Simple Data Augmentation with the Mask Token Improves Domain Adaptation for Dialog Act Tagging. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 5083–5089. https://doi.org/10.18653/v1/2020.emnlp-main.412
  66. Fast and Accurate Reading Comprehension by Combining Self-Attention and Convolution. In International Conference on Learning Representations.
  67. Hierarchical Data Augmentation and the Application in Text Classification. IEEE Access 7 (2019), 185476–185485. https://doi.org/10.1109/ACCESS.2019.2960263
  68. On Data Augmentation for Extreme Multi-label Classification. In arXiv:2009.10778. arXiv:2009.10778
  69. mixup: Beyond Empirical Risk Minimization. In International Conference on Learning Representations.
  70. Parallel Data Augmentation for Formality Style Transfer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 3221–3228. https://doi.org/10.18653/v1/2020.acl-main.294
  71. Learning Physical Common Sense as Knowledge Graph Completion via BERT Data Augmentation and Constrained Tucker Factorization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 3293–3298. https://doi.org/10.18653/v1/2020.emnlp-main.266
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Ed S. Ma (1 paper)
Youtube Logo Streamline Icon: https://streamlinehq.com