Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias (2306.15895v2)
Abstract: LLMs have been recently leveraged as training data generators for various NLP tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation with diversely attributed prompts (e.g., specifying attributes like length and style), which have the potential to yield diverse and attributed generated data. Our investigation focuses on datasets with high cardinality and diverse domains, wherein we demonstrate that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance. Additionally, we present a comprehensive empirical study on data generation encompassing vital aspects like bias, diversity, and efficiency, and highlight three key observations: firstly, synthetic datasets generated by simple prompts exhibit significant biases, such as regional bias; secondly, attribute diversity plays a pivotal role in enhancing model performance; lastly, attributed prompts achieve the performance of simple class-conditional prompts while utilizing only 5\% of the querying cost of ChatGPT associated with the latter. The data and code are available on \url{https://github.com/yueyu1030/AttrPrompt}.
- RAFT: A real-world few-shot text classification benchmark. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021.
- A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
- Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th annual meeting of the association of computational linguistics, 2007.
- Language models are realistic tabular data generators. In The Eleventh International Conference on Learning Representations, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33, 2020.
- Assessing cross-cultural alignment between ChatGPT and human societies: An empirical study. In Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), 2023.
- Mixture of soft prompts for controllable data generation. ArXiv, abs/2303.01580, 2023.
- Relationprompt: Leveraging prompts to generate synthetic data for zero-shot relation triplet extraction. In Findings of the Association for Computational Linguistics: ACL 2022, 2022.
- On the use of arxiv as a dataset. arXiv preprint arXiv:1905.00075, 2019.
- Unsupervised cross-lingual representation learning at scale. In Annual Meeting of the Association for Computational Linguistics (ACL), 2019.
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
- Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
- Self-guided noise-free data generation for efficient zero-shot learning. In International Conference on Learning Representations, 2023.
- Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.
- Tweac: transformer with extendable qa agent classifiers. arXiv preprint arXiv:2104.07081, 2021.
- Breaking the glass ceiling for embedding-based classifiers for large output spaces. Advances in Neural Information Processing Systems, 32, 2019.
- Unifying human and statistical evaluation for natural language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
- DeBERTav3: Improving deBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations, 2023.
- Tinybert: Distilling bert for natural language understanding. In Findings of EMNLP, 2020.
- Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, 2023.
- Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models. In Neural Information Processing Systems, 2021.
- Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022.
- HERB: Measuring hierarchical regional bias in pre-trained language models. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, 2022.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
- Beyond one-model-fits-all: A survey of domain specialization for large language models. arXiv preprint arXiv:2305.18703, 2023.
- WANLI: Worker and AI collaboration for natural language inference dataset creation. In Findings of the Association for Computational Linguistics: EMNLP 2022, 2022.
- Content preserving text generation with attribute controls. Advances in Neural Information Processing Systems, 31, 2018.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, 2011.
- Generating training data with language models: Towards zero-shot language understanding. In Advances in Neural Information Processing Systems, 2022.
- Weakly-supervised hierarchical text classification. In Proceedings of the AAAI conference on artificial intelligence, 2019.
- Dqi: Measuring data quality in nlp. arXiv preprint arXiv:2005.00816, 2020.
- Reframing instructional prompts to GPTk’s language. In Findings of the Association for Computational Linguistics: ACL 2022, 2022.
- HELP ME THINK: A simple prompting strategy for non-experts to create customized content with models. In Findings of the Association for Computational Linguistics: ACL 2023, 2023.
- When does label smoothing help? Advances in neural information processing systems, 32, 2019.
- OpenAI. Gpt-4 technical report. arXiv, 2023.
- OpenAI. Introducing chatgpt, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 2022.
- Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
- Instruction tuning with gpt-4. ArXiv, abs/2304.03277, 2023.
- True few-shot learning with language models. In Advances in Neural Information Processing Systems, 2021.
- Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
- Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, 2021.
- Control, generate, augment: A scalable framework for multi-attribute text generation. Findings of the Association for Computational Linguistics: EMNLP 2020, 2020.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- Generating datasets with pretrained language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.
- Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618, 2023.
- Taxoclass: Hierarchical multi-label text classification using only class names. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, 2013.
- Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047, 2023.
- Evaluating the evaluation of diversity in natural language generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021.
- Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Reframing human-ai collaboration for generating free-text explanations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022.
- ZeroGen: Efficient zero-shot learning via dataset generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022.
- ProGen: Progressive zero-shot dataset generation via in-context feedback. In Findings of the Association for Computational Linguistics: EMNLP 2022, 2022.
- Attribute alignment: Controlling text generation from pre-trained language models. In Findings of the Association for Computational Linguistics: EMNLP 2021, 2021.
- COCO-DR: Combating distribution shifts in zero-shot dense retrieval with contrastive and distributionally robust learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022.
- Regen: Zero-shot text classification via training data generation with progressive dense retrieval. In Findings of the Association for Computational Linguistics: ACL 2023, 2023.
- On the trade-off of intra-/inter-class diversity for supervised pre-training. arXiv preprint arXiv:2305.12224, 2023.
- Prboost: Prompt-based rule discovery and boosting for interactive weakly-supervised learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.
- Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, 2023.
- Exploring ai ethics of chatgpt: A diagnostic analysis. ArXiv, abs/2301.12867, 2023.