Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges (2403.02990v4)

Published 5 Mar 2024 in cs.CL and cs.AI

Abstract: In the rapidly evolving field of LLMs, data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This survey explores the transformative impact of LLMs on DA, particularly addressing the unique challenges and opportunities they present in the context of NLP and beyond. From both data and learning perspectives, we examine various strategies that utilize LLMs for data augmentation, including a novel exploration of learning paradigms where LLM-generated data is used for diverse forms of further training. Additionally, this paper highlights the primary open challenges faced in this domain, ranging from controllable data augmentation to multi-modal data augmentation. This survey highlights a paradigm shift introduced by LLMs in DA, and aims to serve as a comprehensive guide for researchers and practitioners.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (164)
  1. Open-source large language models outperform crowd workers and approach chatgpt in text-annotation tasks. arXiv preprint arXiv:2307.02179.
  2. Out of one, many: Using language models to simulate human samples. Political Analysis, 31:1–15.
  3. Inductive biases for low data vqa: a data augmentation approach. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 231–240.
  4. Context matters: Pushing the boundaries of open-ended answer generation with graph-structured knowledge context. arXiv preprint arXiv:2401.12671.
  5. Parikshit Bansal and Amit Sharma. 2023. Large language models as annotators: Enhancing generalization of nlp models at minimal cost.
  6. A systematic evaluation of large language models on out-of-distribution logical reasoning tasks. arXiv preprint arXiv:2310.09430.
  7. Leveraging chatgpt as text annotation tool for sentiment analysis. arXiv preprint arXiv:2306.17177.
  8. Closing the loop: Testing chatgpt to generate model explanations to improve human labelling of sponsored content on social media.
  9. Desirée Bill and Theodor Eriksson. 2023. Fine-tuning a llm using reinforcement learning from human feedback for a therapy chatbot application.
  10. Inpars: Data augmentation for information retrieval using large language models.
  11. Video generation models as world simulators.
  12. Drlc: Reinforcement learning with dense rewards from llm critic. arXiv preprint arXiv:2401.07382.
  13. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 38–45, Online. Association for Computational Linguistics.
  14. Synthesizing mixed-type electronic health records using diffusion models. arXiv preprint arXiv:2302.14679.
  15. Mixture of soft prompts for controllable data generation. In Conference on Empirical Methods in Natural Language Processing.
  16. Feng Chen and Yujian Feng. 2023. Chain-of-thought prompt distillation for multimodal named entity and multimodal relation extraction. arXiv preprint arXiv:2306.14122.
  17. An empirical survey of data augmentation for limited data learning in NLP. Transactions of the Association for Computational Linguistics, 11:191–211.
  18. DISCO: Distilling counterfactuals with large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5514–5528, Toronto, Canada. Association for Computational Linguistics.
  19. Disco: distilling counterfactuals with large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5514–5528.
  20. Medically aware gpt-3 as a data generator for medical dialogue summarization. In Machine Learning for Healthcare Conference, pages 354–372. PMLR.
  21. Large language models for user interest journeys. arXiv preprint arXiv:2305.15498.
  22. Data-centric financial large language models.
  23. M6-rec: Generative pretrained language models are open-ended recommender systems. arXiv preprint arXiv:2205.08084.
  24. Chataug: Leveraging chatgpt for text data augmentation. arXiv preprint arXiv:2302.13007.
  25. Aminu Da’u and Naomie Salim. 2020. Recommendation system based on deep learning methods: a systematic review and new directions. Artificial Intelligence Review, 53(4):2709–2748.
  26. Holistic exploration on universal decompositional semantic parsing: Architecture, data augmentation, and llm paradigm. arXiv preprint arXiv:2307.13424.
  27. GlobalWoZ: Globalizing MultiWoZ to develop multilingual task-oriented dialogue systems. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1639–1657, Dublin, Ireland. Association for Computational Linguistics.
  28. Is gpt-3 a good data annotator? arXiv preprint arXiv:2212.10450.
  29. CORE: A retrieve-then-edit framework for counterfactual data generation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2964–2984, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  30. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. arXiv preprint arXiv:2104.08758.
  31. Aligning the capabilities of large language models with the context of information retrieval via contrastive feedback. arXiv preprint arXiv:2309.17078.
  32. Boosting source code learning with data augmentation: An empirical study. arXiv preprint arXiv:2303.06808.
  33. Diversify your vision datasets with automatic diffusion-based augmentation. arXiv preprint arXiv:2305.16289.
  34. Diversify your vision datasets with automatic diffusion-based augmentation. Advances in Neural Information Processing Systems, 36.
  35. Datscore: Evaluating translation with data augmented translations. arXiv preprint arXiv:2210.06576.
  36. Recommender systems in the era of large language models (llms). arXiv preprint arXiv:2307.02046.
  37. Chatgpt as data augmentation for compositional generalization: A case study in open intent detection. arXiv preprint arXiv:2308.13517.
  38. A survey of data augmentation approaches for nlp. ArXiv, abs/2105.03075.
  39. A survey of graph neural networks for recommender systems: Challenges, methods, and directions. ACM Transactions on Recommender Systems, 1(1):1–51.
  40. Dale: Generative data augmentation for low-resource legal nlp. arXiv preprint arXiv:2310.15799.
  41. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30).
  42. Large language models respond to influence like humans. In Proceedings of the First Workshop on Social Influence in Conversations (SICon 2023), pages 15–24, Toronto, Canada. Association for Computational Linguistics.
  43. Large language models to identify social determinants of health in electronic health records. npj Digital Medicine, 7(1):1–14.
  44. A survey on knowledge graph-based recommender systems. IEEE Transactions on Knowledge and Data Engineering, 34(8):3549–3568.
  45. Dr. llama: Improving small language models in domain-specific qa via generative data augmentation. arXiv e-prints, pages arXiv–2305.
  46. Re-ranking with constraints on diversified exposures for homepage recommender system. arXiv preprint arXiv:2112.07621.
  47. Allure: A systematic protocol for auditing and improving llm-based evaluation of text using iterative in-context-learning. arXiv preprint arXiv:2309.13701.
  48. Teacherlm: Teaching to fish rather than giving the fish, language modeling likewise. arXiv preprint arXiv:2310.19019.
  49. Using augmented small multimodal models to guide large language models for multimodal relation extraction. Applied Sciences.
  50. Targeted data generation: Finding and fixing model weaknesses. arXiv preprint arXiv:2305.17804.
  51. A survey on recent approaches for natural language processing in low-resource scenarios. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2545–2568, Online. Association for Computational Linguistics.
  52. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071.
  53. Training compute-optimal large language models. ArXiv, abs/2203.15556.
  54. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. arXiv preprint arXiv:2305.02301.
  55. Teacher-student architecture for knowledge learning: A survey. arXiv preprint arXiv:2210.17332.
  56. Up5: Unbiased foundation model for fairness-aware recommendation. arXiv preprint arXiv:2305.12090.
  57. Jing Huang and Diyi Yang. 2023. Culturally aware natural language inference. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7591–7609, Singapore. Association for Computational Linguistics.
  58. Augmentation for context in financial numerical reasoning over textual and tabular data with large-scale language model. In NeurIPS 2023 Second Table Representation Learning Workshop.
  59. Ai alignment: A comprehensive survey. arXiv preprint arXiv:2310.19852.
  60. Panda llm: Training data and evaluation for open-sourced chinese instruction-following large language models. arXiv preprint arXiv:2305.03025.
  61. Score-based generative modeling of graphs via the system of stochastic differential equations. In International Conference on Machine Learning, pages 10362–10383. PMLR.
  62. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.
  63. Andres Karjus. 2023. Machine-assisted mixed methods: augmenting humanities and social sciences with artificial intelligence. ArXiv, abs/2309.14379.
  64. Q: How to specialize large vision-language models to data-scarce vqa tasks? a: Self-train on unlabeled images! In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15005–15015.
  65. Can an llm-powered socially assistive robot effectively and safely deliver cognitive behavioral therapy? a study with university students. arXiv preprint arXiv:2402.17937.
  66. Educational data augmentation in physics education research using chatgpt. Physical Review Physics Education Research, 19(2):020150.
  67. Motif: Intrinsic motivation from artificial intelligence feedback. arXiv preprint arXiv:2310.00166.
  68. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities.
  69. Tackling covid-19 conspiracies on twitter using bert ensembles, gpt-3 augmentation, and graph nns.
  70. Leveraging few-shot data augmentation and waterfall prompting for response generation. arXiv preprint arXiv:2308.01080.
  71. Victor C.W. Cheng Kwok-Yan Lam and Zee Kin Yeong. 2023. Applying large language models for enhancing contract drafting. Proceedings of the Third International Workshop on Artificial Intelligence and Intelligent Assistance for Legal Professionals in the Digital Workspace (LegalAIIA 2023).
  72. Reward design with language models. arXiv preprint arXiv:2303.00001.
  73. Unraveling chatgpt: A critical analysis of ai-generated goal-oriented dialogues and annotations.
  74. Can large language models aid in annotating speech emotional data? uncovering new frontiers. arXiv preprint arXiv:2307.06090.
  75. Coannotating: Uncertainty-guided work allocation between human and large language models for data annotation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing.
  76. Autoconv: Automatically generating information-seeking conversations with large language models. arXiv preprint arXiv:2308.06507.
  77. Self-alignment with instruction backtranslation. arXiv preprint arXiv:2308.06259.
  78. Large language models as counterfactual generator: Strengths and weaknesses. arXiv preprint arXiv:2305.14791.
  79. Latesteval: Addressing data contamination in language model evaluation through dynamic and time-sensitive test construction. ArXiv, abs/2312.12343.
  80. Controllable dialogue simulation with in-context learning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 4330–4347, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  81. Controllable dialogue simulation with in-context learning. arXiv preprint arXiv:2210.04185.
  82. Selective in-context data augmentation for intent detection using pointwise V-information. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1463–1476, Dubrovnik, Croatia. Association for Computational Linguistics.
  83. Are multilingual llms culturally-diverse reasoners? an investigation into multicultural proverbs and sayings. arXiv preprint arXiv:2309.08591.
  84. Tell me how to ask again: Question data augmentation with controllable rewriting in continuous space. In Conference on Empirical Methods in Natural Language Processing.
  85. Logicot: Logical chain-of-thought instruction-tuning data collection with gpt-4. arXiv preprint arXiv:2305.12147.
  86. In-context vectors: Making in context learning more effective and controllable through latent space steering. arXiv preprint arXiv:2311.06668.
  87. Benchmarking generation and evaluation capabilities of large language models for instruction controllable summarization. ArXiv, abs/2311.09184.
  88. Benchmarking generation and evaluation capabilities of large language models for instruction controllable summarization. arXiv preprint arXiv:2311.09184.
  89. On learning to summarize with large language models as references. arXiv preprint arXiv:2305.14239.
  90. Augmenting Reddit posts to determine wellness dimensions impacting mental health. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 306–312, Toronto, Canada. Association for Computational Linguistics.
  91. Dialgen: Collaborative human-lm generated dialogues for improved understanding of human-human conversations.
  92. Chain of history: Learning and forecasting with llms for temporal knowledge graph completion. arXiv preprint arXiv:2401.06072.
  93. Collaborative sequential recommendations via multi-view gnn-transformers. ACM Transactions on Information Systems, 42.
  94. Conditional graph generation with graph principal flow network. In ICML 2023 Workshop on Structured Probabilistic Inference {normal-{\{{\normal-\\backslash\&}normal-}\}} Generative Modeling.
  95. Fast graph generation via spectral diffusion. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  96. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931.
  97. Inbal Magar and Roy Schwartz. 2022. Data contamination: From memorization to exploitation. arXiv preprint arXiv:2203.08242.
  98. European clinical case corpus. In European Language Grid: A Language Technology Platform for Multilingual Europe, pages 283–288. Springer International Publishing Cham.
  99. Generating mathematical derivations with large language models. arXiv preprint arXiv:2307.09998.
  100. Tuning language models as training data generators for augmentation-enhanced few-shot learning. In International Conference on Machine Learning, pages 24457–24477. PMLR.
  101. Large language models as instructors: A study on multilingual clinical entity extraction. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 178–190.
  102. Graph principal flow network for conditional graph generation. In Proceedings of the ACM Web Conference 2024.
  103. Is a prompt and a few samples all you need? using gpt-4 for data augmentation in low-resource classification tasks. arXiv preprint arXiv:2304.13861.
  104. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv preprint arXiv:2306.02707.
  105. Data augmentation for modeling human personality: The dexter machine.
  106. Data augmentation for neural machine translation using generative language model. arXiv preprint arXiv:2307.16833.
  107. Steering language generation: Harnessing contrastive expert guidance and negative prompting for coherent and diverse synthetic data generation. arXiv preprint arXiv:2308.07645.
  108. Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering.
  109. Keivalya Pandya and Mehfuza Holia. 2023. Automating customer service using langchain: Building custom open-source gpt chatbot for organizations. arXiv preprint arXiv:2310.05421.
  110. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  111. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
  112. Llms in e-commerce: a comparative analysis of gpt and llama models in product review evaluation. Natural Language Processing Journal, 6:100056.
  113. Data augmentation for intent classification with off-the-shelf large language models. In Proceedings of the 4th Workshop on NLP for Conversational AI, pages 47–57, Dublin, Ireland. Association for Computational Linguistics.
  114. Data augmentation for intent classification with off-the-shelf large language models. arXiv preprint arXiv:2204.01959.
  115. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. In Conference on Empirical Methods in Natural Language Processing.
  116. Can llms augment low-resource reading comprehension datasets? opportunities and challenges. arXiv preprint arXiv:2309.12426.
  117. Medical data augmentation via chatgpt: A case study on medication identification and medication event classification. arXiv preprint arXiv:2306.07297.
  118. Pulsar at mediqa-sum 2023: Large language models augmented by synthetic dialogue convert patient dialogues to medical records. arXiv preprint arXiv:2307.02006.
  119. When and how to paraphrase for named entity recognition? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7052–7087, Toronto, Canada. Association for Computational Linguistics.
  120. When and how to paraphrase for named entity recognition? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7052–7087.
  121. Automatic prompt augmentation and selection with chain-of-thought from labeled data. arXiv preprint arXiv:2302.12822.
  122. From fake to hyperpartisan news detection using domain adaptation. arXiv preprint arXiv:2308.02185.
  123. Vishvesh Soni. 2023. Large language models for enhancing customer lifecycle management. Journal of Empirical Social Science Studies, 7(1):67–89.
  124. Joe Stacey and Marek Rei. 2023. Improving robustness in knowledge distillation using domain-targeted data augmentation. arXiv preprint arXiv:2305.13067.
  125. Just-in-time security patch detection–llm at the rescue for data augmentation. arXiv preprint arXiv:2312.01241.
  126. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  127. Petter Törnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588.
  128. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  129. Petter Törnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning.
  130. Will we run out of data? an analysis of the limits of scaling datasets in machine learning. arXiv preprint arXiv:2211.04325.
  131. A unified dialogue user simulator for few-shot data augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3788–3799, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  132. A unified dialogue user simulator for few-shot data augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 3788–3799.
  133. T-sciq: Teaching multimodal chain-of-thought reasoning via large language model signals for science question answering. arXiv preprint arXiv:2305.03453.
  134. Scott: Self-consistent chain-of-thought distillation. arXiv preprint arXiv:2305.01879.
  135. Want to reduce labeling cost? gpt-3 can help. In Conference on Empirical Methods in Natural Language Processing.
  136. Weaver: Foundation models for creative writing. arXiv preprint arXiv:2401.17268.
  137. Augmenting language models with long-term memory. arXiv preprint arXiv:2306.07174.
  138. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  139. Llm-powered data augmentation for enhanced crosslingual performance. arXiv preprint arXiv:2305.14288.
  140. Multimodal data augmentation for image captioning using diffusion models. arXiv preprint arXiv:2305.01855.
  141. Large language models can learn temporal reasoning. arXiv preprint arXiv:2401.06853.
  142. Gpt4tools: Teaching large language model to use tools via self-instruction. Advances in Neural Information Processing Systems, 36.
  143. Empowering llm-based machine translation with cultural awareness. arXiv preprint arXiv:2305.14328.
  144. Qing Ru Yong and Alex Mitchell. 2023. From playing the story to gaming the system: Repeat experiences of a large language model-based interactive story. In International Conference on Interactive Digital Storytelling, pages 395–409. Springer.
  145. GPT3Mix: Leveraging large-scale language models for text augmentation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2225–2239, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  146. Large language models for healthcare data augmentation: An example on patient-trial matching. arXiv preprint arXiv:2303.16756.
  147. Llm for patient-trial matching: Privacy-aware data augmentation towards better performance and generalizability. ArXiv, abs/2303.16756.
  148. Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277.
  149. Large language models for social networks: Applications, challenges, and solutions. arXiv preprint arXiv:2401.02575.
  150. Data-centric artificial intelligence: A survey. arXiv preprint arXiv:2303.10158.
  151. Prompting large language model for machine translation: A case study. arXiv preprint arXiv:2301.07069.
  152. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858.
  153. Recommendation as instruction following: A large language model empowered recommendation approach.
  154. Toolcoder: Teach code generation models to use apis with search tools. arXiv preprint arXiv:2305.04032.
  155. Ask an expert: Leveraging language models to improve strategic reasoning in goal-oriented dialogue models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6665–6694, Toronto, Canada. Association for Computational Linguistics.
  156. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792.
  157. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  158. Group preference optimization: Few-shot alignment of large language models. arXiv preprint arXiv:2310.11523.
  159. Chatbridge: Bridging modalities with large language model as a language catalyst. arXiv preprint arXiv:2305.16103.
  160. AugESC: Dialogue augmentation with large language models for emotional support conversation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1552–1568, Toronto, Canada. Association for Computational Linguistics.
  161. Augesc: Dialogue augmentation with large language models for emotional support conversation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 1552–1568.
  162. Improving conversational recommender systems via knowledge graph based semantic fusion. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1006–1014.
  163. Calypso: Llms as dungeon master’s assistants. In Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 19, pages 380–390.
  164. Can chatgpt reproduce human-generated labels? a study of social computing tasks.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Bosheng Ding (16 papers)
  2. Chengwei Qin (28 papers)
  3. Ruochen Zhao (15 papers)
  4. Tianze Luo (11 papers)
  5. Xinze Li (34 papers)
  6. Guizhen Chen (11 papers)
  7. Wenhan Xia (13 papers)
  8. Junjie Hu (111 papers)
  9. Anh Tuan Luu (69 papers)
  10. Shafiq Joty (187 papers)
Citations (15)