Effects of diversity incentives on sample diversity and downstream model performance in LLM-based text augmentation (2401.06643v3)
Abstract: The latest generative LLMs have found their application in data augmentation tasks, where small numbers of text samples are LLM-paraphrased and then used to fine-tune downstream models. However, more research is needed to assess how different prompts, seed data selection strategies, filtering methods, or model settings affect the quality of paraphrased data (and downstream models). In this study, we investigate three text diversity incentive methods well established in crowdsourcing: taboo words, hints by previous outlier solutions, and chaining on previous outlier solutions. Using these incentive methods as part of instructions to LLMs augmenting text datasets, we measure their effects on generated texts lexical diversity and downstream model performance. We compare the effects over 5 different LLMs, 6 datasets and 2 downstream models. We show that diversity is most increased by taboo words, but downstream model performance is highest with hints.
- ChatGPT to replace crowdsourcing of paraphrases for intent classification: Higher diversity and comparable model robustness. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1889–1905, Singapore. Association for Computational Linguistics.
- A semantically consistent and syntactically variational encoder-decoder framework for paraphrase generation. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1186–1198, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Novelty controlled paraphrase generation with retrieval augmented conditional prompt tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 10535–10544.
- Prompting a large language model to generate diverse motivational messages: A comparison with human-written messages. In Proceedings of the 11th International Conference on Human-Agent Interaction, HAI ’23, page 378–380, New York, NY, USA. Association for Computing Machinery.
- Auggpt: Leveraging chatgpt for text data augmentation.
- Using gpt-4 to augment unbalanced data for automatic scoring.
- Dale: Generative data augmentation for low-resource legal nlp. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Sentosa, Singapore.
- Tanya Goyal and Greg Durrett. 2020. Neural syntactic preordering for controlled paraphrase generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 238–252, Online. Association for Computational Linguistics.
- Designing llm chains by adapting techniques from crowdsourcing workflows.
- The ATIS spoken language systems pilot corpus. In Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27,1990.
- Mistral 7b.
- An investigation of the (in)effectiveness of counterfactually augmented data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3668–3681, Dublin, Ireland. Association for Computational Linguistics.
- Reformulating unsupervised style transfer as paraphrase generation. EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, pages 737–762.
- Ken Lang. 1995. Newsweeder: Learning to filter netnews. In Armand Prieditis and Stuart Russell, editors, Machine Learning Proceedings 1995, pages 331–339. Morgan Kaufmann, San Francisco (CA).
- Outlier detection for improved data quality and diversity in dialog systems. NAACL HLT 2019 - 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies - Proceedings of the Conference, 1:517–527.
- Iterative feature mining for constraint-based data collection to increase data diversity and model robustness. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8097–8106, Online. Association for Computational Linguistics.
- Platypus: Quick, cheap, and powerful refinement of llms.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. In International Conference on Learning Representations.
- Few-shot fine-tuning vs. in-context learning: A fair comparison and evaluation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12284–12314, Toronto, Canada. Association for Computational Linguistics.
- On the effects of randomness on stability of learning with limited labelled data: A systematic literature review. arXiv preprint arXiv:2312.01082.
- Frédéric Piedboeuf and Philippe Langlais. 2023. Is ChatGPT the ultimate data augmentation algorithm? In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15606–15615, Singapore. Association for Computational Linguistics.
- Language models are unsupervised multitask learners.
- Directed diversity: Leveraging language embedding distances for collective creativity in crowd ideation. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, New York, NY, USA. Association for Computing Machinery.
- Cross-lingual transfer learning for multilingual task oriented dialog. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3795–3805, Minneapolis, Minnesota. Association for Computational Linguistics.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
- Brian Thompson and Matt Post. 2020. Paraphrase generation as zero-shot multilingual translation: Disentangling semantic similarity from lexical and syntactic diversity. In Proceedings of the Fifth Conference on Machine Translation, pages 561–570, Online. Association for Computational Linguistics.
- Llama 2: Open foundation and fine-tuned chat models.
- A ship of theseus: Curious cases of paraphrasing in llm-generated texts.
- Zeroshotdataaug: Generating and augmenting training data with chatgpt.
- Toward learning human-aligned cross-domain robust models by countering misaligned features. In Uncertainty in Artificial Intelligence, pages 2075–2084. PMLR.
- Effective quality assurance for data labels through crowdsourcing and domain expert collaboration. In Advances in Database Technology - EDBT 2018, Advances in Database Technology - EDBT, pages 646–649. OpenProceedings.org.
- Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
- Jianing Zhou and Suma Bhat. 2020. Dynamic word recommendation to obtain diverse crowdsourced paraphrases of user utterances. In International Conference on Intelligent User Interfaces, Proceedings IUI, pages 55–66.
- Jan Cegin (6 papers)
- Branislav Pecher (12 papers)
- Jakub Simko (18 papers)
- Ivan Srba (28 papers)
- Maria Bielikova (27 papers)
- Peter Brusilovsky (15 papers)