Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias (2306.15895v2)

Published 28 Jun 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have been recently leveraged as training data generators for various NLP tasks. While previous research has explored different approaches to training models using generated data, they generally rely on simple class-conditional prompts, which may limit the diversity of the generated data and inherit systematic biases of LLM. Thus, we investigate training data generation with diversely attributed prompts (e.g., specifying attributes like length and style), which have the potential to yield diverse and attributed generated data. Our investigation focuses on datasets with high cardinality and diverse domains, wherein we demonstrate that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance. Additionally, we present a comprehensive empirical study on data generation encompassing vital aspects like bias, diversity, and efficiency, and highlight three key observations: firstly, synthetic datasets generated by simple prompts exhibit significant biases, such as regional bias; secondly, attribute diversity plays a pivotal role in enhancing model performance; lastly, attributed prompts achieve the performance of simple class-conditional prompts while utilizing only 5\% of the querying cost of ChatGPT associated with the latter. The data and code are available on \url{https://github.com/yueyu1030/AttrPrompt}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. RAFT: A real-world few-shot text classification benchmark. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021.
  2. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
  3. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the 45th annual meeting of the association of computational linguistics, 2007.
  4. Language models are realistic tabular data generators. In The Eleventh International Conference on Learning Representations, 2023.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33, 2020.
  6. Assessing cross-cultural alignment between ChatGPT and human societies: An empirical study. In Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), 2023.
  7. Mixture of soft prompts for controllable data generation. ArXiv, abs/2303.01580, 2023.
  8. Relationprompt: Leveraging prompts to generate synthetic data for zero-shot relation triplet extraction. In Findings of the Association for Computational Linguistics: ACL 2022, 2022.
  9. On the use of arxiv as a dataset. arXiv preprint arXiv:1905.00075, 2019.
  10. Unsupervised cross-lingual representation learning at scale. In Annual Meeting of the Association for Computational Linguistics (ACL), 2019.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
  12. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
  13. Self-guided noise-free data generation for efficient zero-shot learning. In International Conference on Learning Representations, 2023.
  14. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.
  15. Tweac: transformer with extendable qa agent classifiers. arXiv preprint arXiv:2104.07081, 2021.
  16. Breaking the glass ceiling for embedding-based classifiers for large output spaces. Advances in Neural Information Processing Systems, 32, 2019.
  17. Unifying human and statistical evaluation for natural language generation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
  18. DeBERTav3: Improving deBERTa using ELECTRA-style pre-training with gradient-disentangled embedding sharing. In The Eleventh International Conference on Learning Representations, 2023.
  19. Tinybert: Distilling bert for natural language understanding. In Findings of EMNLP, 2020.
  20. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, 2023.
  21. Bias out-of-the-box: An empirical analysis of intersectional occupational biases in popular generative language models. In Neural Information Processing Systems, 2021.
  22. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022.
  23. HERB: Measuring hierarchical regional bias in pre-trained language models. In Findings of the Association for Computational Linguistics: AACL-IJCNLP 2022, 2022.
  24. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022.
  25. Beyond one-model-fits-all: A survey of domain specialization for large language models. arXiv preprint arXiv:2305.18703, 2023.
  26. WANLI: Worker and AI collaboration for natural language inference dataset creation. In Findings of the Association for Computational Linguistics: EMNLP 2022, 2022.
  27. Content preserving text generation with attribute controls. Advances in Neural Information Processing Systems, 31, 2018.
  28. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  29. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, 2011.
  30. Generating training data with language models: Towards zero-shot language understanding. In Advances in Neural Information Processing Systems, 2022.
  31. Weakly-supervised hierarchical text classification. In Proceedings of the AAAI conference on artificial intelligence, 2019.
  32. Dqi: Measuring data quality in nlp. arXiv preprint arXiv:2005.00816, 2020.
  33. Reframing instructional prompts to GPTk’s language. In Findings of the Association for Computational Linguistics: ACL 2022, 2022.
  34. HELP ME THINK: A simple prompting strategy for non-experts to create customized content with models. In Findings of the Association for Computational Linguistics: ACL 2023, 2023.
  35. When does label smoothing help? Advances in neural information processing systems, 32, 2019.
  36. OpenAI. Gpt-4 technical report. arXiv, 2023.
  37. OpenAI. Introducing chatgpt, 2023.
  38. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 2022.
  39. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
  40. Instruction tuning with gpt-4. ArXiv, abs/2304.03277, 2023.
  41. True few-shot learning with language models. In Advances in Neural Information Processing Systems, 2021.
  42. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019.
  43. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, 2021.
  44. Control, generate, augment: A scalable framework for multi-attribute text generation. Findings of the Association for Computational Linguistics: EMNLP 2020, 2020.
  45. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  46. Generating datasets with pretrained language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.
  47. Synthetic prompting: Generating chain-of-thought demonstrations for large language models. arXiv preprint arXiv:2302.00618, 2023.
  48. Taxoclass: Hierarchical multi-label text classification using only class names. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.
  49. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, 2013.
  50. Principle-driven self-alignment of language models from scratch with minimal human supervision. arXiv preprint arXiv:2305.03047, 2023.
  51. Evaluating the evaluation of diversity in natural language generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021.
  52. Symmetric cross entropy for robust learning with noisy labels. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019.
  53. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  54. Reframing human-ai collaboration for generating free-text explanations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022.
  55. ZeroGen: Efficient zero-shot learning via dataset generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022.
  56. ProGen: Progressive zero-shot dataset generation via in-context feedback. In Findings of the Association for Computational Linguistics: EMNLP 2022, 2022.
  57. Attribute alignment: Controlling text generation from pre-trained language models. In Findings of the Association for Computational Linguistics: EMNLP 2021, 2021.
  58. COCO-DR: Combating distribution shifts in zero-shot dense retrieval with contrastive and distributionally robust learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022.
  59. Regen: Zero-shot text classification via training data generation with progressive dense retrieval. In Findings of the Association for Computational Linguistics: ACL 2023, 2023.
  60. On the trade-off of intra-/inter-class diversity for supervised pre-training. arXiv preprint arXiv:2305.12224, 2023.
  61. Prboost: Prompt-based rule discovery and boosting for interactive weakly-supervised learning. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022.
  62. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  63. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.
  64. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, 2023.
  65. Exploring ai ethics of chatgpt: A diagnostic analysis. ArXiv, abs/2301.12867, 2023.
Citations (153)

Summary

  • The paper introduces AttrPrompt, an attributed prompt method that significantly enhances data diversity and reduces regional bias compared to conventional approaches.
  • The study empirically validates AttrPrompt across high-cardinality datasets, achieving superior model performance at only 5% of the querying cost of traditional methods.
  • The paper demonstrates that integrating attributed synthetic data improves long-tail and multi-label classification performance while optimizing budget efficiency.

Insights into "LLM as Attributed Training Data Generator: A Tale of Diversity and Bias"

The paper "LLM as Attributed Training Data Generator: A Tale of Diversity and Bias" offers an in-depth investigation into the generation of synthetic training data using LLMs with attributed prompts. The work addresses a significant issue in the current methodology of using LLM-generated data, which primarily relies on simple class-conditional prompts. These methods often result in a lack of diversity and the perpetuation of biases inherent to the LLMs. The authors propose a solution involving diversely attributed prompts, demonstrating that this approach can yield results that surpass class-conditional prompts in multiple facets.

Key Contributions and Findings:

  1. Attributed vs. Class-Conditional Prompts: The authors challenge the conventional use of class-conditional prompts—referred to as SimPrompt—which has shown both a significant regional bias and limited diversity in data generation. They introduce AttrPrompt, a method leveraging attributed prompts that incorporate various attributes such as length and style, tailored for different classes. This method not only enhances the diversity of the generated dataset but also markedly reduces biases.
  2. Empirical Validation: Through comprehensive experiments across high cardinality datasets and diverse domains, AttrPrompt exhibits improved model performance and efficacy over SimPrompt. Particularly, models trained with datasets generated using attributed prompts consume only 5% of the querying costs associated with SimPrompt while achieving equivalent or superior performance metrics.
  3. Bias and Diversity Analysis: A pivotal aspect of the paper is the exploration of dataset biases and diversity metrics. Notably, datasets generated with SimPrompt exhibited pronounced biases toward certain regions. AttrPrompt managed to mitigate these biases and foster a more balanced attribute representation, as validated through both manual annotations and trained attribute classifiers.
  4. Performance Implications: The authors empirically show that models trained on data generated via AttrPrompt outperform those trained on datasets generated through SimPrompt, especially in terms of diversity and the handling of long-tail class issues. Additionally, augmenting existing datasets with attributed generated data yields consistent performance improvements.
  5. Cost Efficiency: AttrPrompt demonstrates superior budget efficiency, highlighting significant cost reductions due to decreased query frequencies without compromising on data quality or model performance. This optimization is a notable step toward more practical applications where budget constraints are a concern.
  6. Extension to Multi-Label Classification: The paper ventures into the field of multi-label classification, serving as a pioneering attempt to leverage LLM-generated training data in this context. AttrPrompt again showcases enhanced performance across various multi-label evaluation metrics compared to its counterparts, setting a foundation for future research endeavors in similar domains.

Implications and Future Directions:

The discussion presented by the authors on attributed data generation opens pathways for significant advancements in the field of AI, particularly in the refinement of synthetic data generation techniques. The implications of this research are multifaceted:

  • Increased Accessibility: By decreasing the costs associated with synthetic data generation, AttrPrompt may democratize access to quality training datasets, especially in resource-constrained environments.
  • Bias Reduction: The framework offers a promising avenue for tackling embedded biases in AI systems, a critical concern in the deployment of fair and reliable machine learning applications.
  • Diverse Application Potential: While focused on text classification, the concept of attributed prompts holds potential for broader application across different modalities and tasks, encouraging future exploration in domains such as image and audio processing.

In conclusion, "LLM as Attributed Training Data Generator: A Tale of Diversity and Bias" advocates a methodologically sound, empirically validated approach to training data generation. It not only enhances model performance but also addresses practical concerns around bias and cost-efficiency. As LLMs evolve, so too does the potential for innovations such as AttrPrompt to reshape the landscape of artificial intelligence research and deployment.