Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Under the Surface: Tracking the Artifactuality of LLM-Generated Data (2401.14698v2)

Published 26 Jan 2024 in cs.CL and cs.AI

Abstract: This work delves into the expanding role of LLMs in generating artificial data. LLMs are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. As these forms of LLM-generated data often intersect in their application, they exert mutual influence on each other and raise significant concerns about the quality and diversity of the artificial data incorporated into training cycles, leading to an artificial data ecosystem. To the best of our knowledge, this is the first study to aggregate various types of LLM-generated text data, from more tightly constrained data like "task labels" to more lightly constrained "free-form text". We then stress test the quality and implications of LLM-generated artificial data, comparing it with human data across various existing benchmarks. Despite artificial data's capability to match human performance, this paper reveals significant hidden disparities, especially in complex tasks where LLMs often miss the nuanced understanding of intrinsic human-generated content. This study critically examines diverse LLM-generated data and emphasizes the need for ethical practices in data creation and when using LLMs. It highlights the LLMs' shortcomings in replicating human traits and behaviors, underscoring the importance of addressing biases and artifacts produced in LLM-generated content for future research and development. All data and code are available on our project page.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (97)
  1. Using large language models to simulate multiple humans and replicate human subject studies. In International Conference on Machine Learning, pages 337–371. PMLR.
  2. Training a helpful and harmless assistant with reinforcement learning from human feedback.
  3. Benchmarking foundation models with language-model-as-an-examiner.
  4. Tweeteval: Unified benchmark and comparative evaluation for tweet classification. arXiv preprint arXiv:2010.12421.
  5. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA. Association for Computing Machinery.
  6. Analyzing the effects of annotator gender across nlp tasks. In Proceedings of the 1st Workshop on Perspectivist Approaches to NLP@ LREC2022, pages 10–19.
  7. Large language models suffer from their own output: An analysis of the self-consuming training loop. arXiv preprint arXiv:2311.16822.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  9. Chateval: Towards better llm-based evaluators through multi-agent debate.
  10. David Chan. 2022. Approaches to disagreements. In Dealing with Disagreements, pages 7–18. World Scientific.
  11. Canyu Chen and Kai Shu. 2023. Can llm-generated misinformation be detected?
  12. Reconcile: Round-table conference improves reasoning via consensus among diverse llms.
  13. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  14. All that’s ‘human’ is not gold: Evaluating human evaluation of generated text. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7282–7296, Online. Association for Computational Linguistics.
  15. Free dolly: Introducing the world’s first truly open instruction-tuned llm.
  16. Addressing age-related bias in sentiment analysis. In Proceedings of the 2018 chi conference on human factors in computing systems, pages 1–14.
  17. Can ai language models replace human participants? Trends in Cognitive Sciences.
  18. Is gpt-3 a good data annotator? arXiv preprint arXiv:2212.10450.
  19. Is GPT-3 text indistinguishable from human text? scarecrow: A framework for scrutinizing machine text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7250–7274, Dublin, Ireland. Association for Computational Linguistics.
  20. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy. Association for Computational Linguistics.
  21. When the majority is wrong: Leveraging annotator disagreement for subjective tasks. arXiv preprint arXiv:2305.06626.
  22. Social chemistry 101: Learning to reason about social and moral norms. arXiv preprint arXiv:2011.00620.
  23. Bias and fairness in large language models: A survey.
  24. Human-like summarization evaluation with chatgpt.
  25. Datasheets for datasets. Commun. ACM, 64(12):86–92.
  26. Koala: A dialogue model for academic research. Blog post.
  27. Jury learning: Integrating dissenting voices into machine learning models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–19.
  28. How close is chatgpt to human experts? comparison corpus, evaluation, and detection.
  29. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597.
  30. Jochen Hartmann. 2022. Emotion english distilroberta-base. https://huggingface.co/j-hartmann/emotion-english-distilroberta-base/.
  31. Does bert learn as humans perceive? understanding linguistic styles through lexica.
  32. How far can we extract diverse perspectives from large language models? criteria-based diversity prompting!
  33. Understanding machine learning practitioners’ data documentation perceptions, needs, challenges, and desiderata. Proc. ACM Hum.-Comput. Interact., 6(CSCW2).
  34. Unnatural instructions: Tuning language models with (almost) no human labor.
  35. An empirical study of metrics to measure representational harms in pre-trained language models. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023), pages 121–134, Toronto, Canada. Association for Computational Linguistics.
  36. Learning preference model for llms via automatic preference data generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9187–9199.
  37. Introducing the gab hate corpus: defining and applying hate-based rhetoric to social media posts at scale. Language Resources and Evaluation, pages 1–30.
  38. Unifiedqa: Crossing format boundaries with a single qa system.
  39. Prefer to classify: Improving text classifiers via auxiliary preference learning. Proceedings of the 40th International Conference on Machine Learning (ICML).
  40. Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 193–203, Tampere, Finland. European Association for Machine Translation.
  41. Benchmarking cognitive biases in large language models as evaluators.
  42. Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense. In Thirty-seventh Conference on Neural Information Processing Systems.
  43. Po-Nien Kung and Nanyun Peng. 2023. Do models really learn to follow instructions? an empirical study of instruction tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1317–1328, Toronto, Canada. Association for Computational Linguistics.
  44. Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning.
  45. Camel: Communicative agents for "mind" exploration of large language model society.
  46. Deepfake text detection in the wild. arXiv preprint arXiv:2305.13242.
  47. Framebert: Conceptual metaphor detection with frame embedding learning. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1550–1555.
  48. Towards understanding and mitigating social biases in language models. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 6565–6576. PMLR.
  49. Encouraging divergent thinking in large language models through multi-agent debate. arXiv preprint arXiv:2305.19118.
  50. Is chatgpt a good recommender? a preliminary study. arXiv preprint arXiv:2304.10149.
  51. G-eval: Nlg evaluation using gpt-4 with better human alignment.
  52. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  53. DMRST: A joint framework for document-level multilingual RST discourse segmentation and parsing. In Proceedings of the 2nd Workshop on Computational Approaches to Discourse, pages 154–164, Punta Cana, Dominican Republic and Online. Association for Computational Linguistics.
  54. Annotation imputation to individualize predictions: Initial studies on distribution dynamics and model predictions.
  55. William C Mann and Sandra A Thompson. 1987. Rhetorical structure theory: A theory of text organization. University of Southern California, Information Sciences Institute Los Angeles.
  56. Sources of hallucination by large language models on inference tasks. arXiv preprint arXiv:2305.14552.
  57. Documenting data production processes: A participatory approach for data work. Proc. ACM Hum.-Comput. Interact., 6(CSCW2).
  58. Natural instructions: Benchmarking generalization to new tasks from natural language instructions. arXiv preprint arXiv:2104.08773.
  59. Is a prompt and a few samples all you need? using gpt-4 for data augmentation in low-resource classification tasks. arXiv preprint arXiv:2304.13861.
  60. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  61. Biases in large language models: Origins, inventory, and discussion. J. Data and Information Quality, 15(2).
  62. Instruction in the wild: A user-based instruction dataset. https://github.com/XueFuzhao/InstructionWild.
  63. Gpt-4 technical report.
  64. Training language models to follow instructions with human feedback.
  65. On the risk of misinformation pollution with large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1389–1403, Singapore. Association for Computational Linguistics.
  66. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  67. DynaSent: A dynamic benchmark for sentiment analysis. arXiv preprint arXiv:2012.15349.
  68. On releasing annotator-level labels and information in datasets. arXiv preprint arXiv:2110.05699.
  69. Data cards: Purposeful and transparent dataset documentation for responsible ai. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 1776–1826, New York, NY, USA. Association for Computing Machinery.
  70. Sudha Rao and Joel Tetreault. 2018. Dear sir or madam, may i introduce the gyafc dataset: Corpus, benchmarks and metrics for formality style transfer. arXiv preprint arXiv:1803.06535.
  71. Michael V Reiss. 2023. Testing the reliability of chatbot for text annotation and classification: A cautionary remark. arXiv preprint arXiv:2304.11085.
  72. Two contrasting data annotation paradigms for subjective nlp tasks. arXiv preprint arXiv:2112.07475.
  73. Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156.
  74. Can ai-generated text be reliably detected?
  75. Seedbert: Recovering annotator rating distributions from an aggregated label. arXiv preprint arXiv:2211.13196.
  76. Multitask prompted training enables zero-shot task generalization.
  77. Nlpositionality: Characterizing design biases of datasets and models. arXiv preprint arXiv:2306.01943.
  78. Social bias frames: Reasoning about social and power implications of language. arXiv preprint arXiv:1911.03891.
  79. The curse of recursion: Training on generated data makes models forget.
  80. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  81. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  82. Llama: Open and efficient foundation language models.
  83. Celine Wald and Lukas Pfahler. 2023. Exposing bias in online communities through large-scale language models.
  84. Everyone’s voice matters: Quantifying annotation disagreement using demographic information. arXiv preprint arXiv:2301.05036.
  85. Large language models are not fair evaluators.
  86. Want to reduce labeling cost? gpt-3 can help. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4195–4205.
  87. Want to reduce labeling cost? gpt-3 can help. arXiv preprint arXiv:2108.13487.
  88. Self-instruct: Aligning language models with self-generated instructions.
  89. Super-naturalinstructions:generalization via declarative instructions on 1600+ tasks. In EMNLP.
  90. Super-NaturalInstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 5085–5109, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  91. Unleashing cognitive synergy in large language models: A task-solving agent through multi-persona self-collaboration.
  92. Finetuned language models are zero-shot learners.
  93. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 214–229, New York, NY, USA. Association for Computing Machinery.
  94. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122. Association for Computational Linguistics.
  95. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196.
  96. Judging llm-as-a-judge with mt-bench and chatbot arena.
  97. Terry Yue Zhuo. 2023. Large language models are state-of-the-art evaluators of code generation.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (17)
  1. Debarati Das (25 papers)
  2. Karin De Langis (10 papers)
  3. Jaehyung Kim (44 papers)
  4. Minhwa Lee (7 papers)
  5. Zae Myung Kim (15 papers)
  6. Risako Owan (3 papers)
  7. Bin Hu (217 papers)
  8. Ritik Parkar (1 paper)
  9. Ryan Koo (6 papers)
  10. Jonginn Park (1 paper)
  11. Aahan Tyagi (2 papers)
  12. Libby Ferland (2 papers)
  13. Sanjali Roy (1 paper)
  14. Vincent Liu (33 papers)
  15. Dongyeop Kang (72 papers)
  16. Anna Martin-Boyle (3 papers)
  17. Shirley Anugrah Hayati (13 papers)
Citations (12)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com