Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks
Abstract: LLMs are remarkable data annotators. They can be used to generate high-fidelity supervised training data, as well as survey and experimental data. With the widespread adoption of LLMs, human gold--standard annotations are key to understanding the capabilities of LLMs and the validity of their results. However, crowdsourcing, an important, inexpensive way to obtain human annotations, may itself be impacted by LLMs, as crowd workers have financial incentives to use LLMs to increase their productivity and income. To investigate this concern, we conducted a case study on the prevalence of LLM usage by crowd workers. We reran an abstract summarization task from the literature on Amazon Mechanical Turk and, through a combination of keystroke detection and synthetic text classification, estimate that 33-46% of crowd workers used LLMs when completing the task. Although generalization to other, less LLM-friendly tasks is unclear, our results call for platforms, researchers, and crowd workers to find new ways to ensure that human data remain human, perhaps using the methodology proposed here as a stepping stone. Code/data: https://github.com/epfl-dlab/GPTurk
- Out of one, many: Using language models to simulate human samples. arXiv preprint arXiv:2209.06899.
- Crowdsourcing multi-label classification for taxonomy creation. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 1, pages 25–33.
- Amazon’s mechanical turk: A new source of inexpensive, yet high-quality, data? Perspectives on psychological science, 6(1):3–5.
- Who is mturk? personal characteristics and sample consistency of these online workers. Mental Health, Religion & Culture, 21(9-10):934–944.
- Michael Chmielewski and Sarah C Kucker. 2020. An mturk crisis? shifts in data quality and the impact on study results. Social Psychological and Personality Science, 11(4):464–473.
- Robert Dale. 2021. Gpt-3: What’s it good for? Natural Language Engineering, 27(1):113–118.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
- Can ai language models replace human participants? Trends in Cognitive Sciences.
- Fingerprinting fine-tuned language models in the wild. arXiv preprint arXiv:2106.01703.
- A checklist to combat cognitive biases in crowdsourcing. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 9, pages 48–59.
- Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056.
- Mary L Gray and Siddharth Suri. 2019. Ghost work: How to stop Silicon Valley from building a new global underclass. Eamon Dolan Books.
- Message distortion in information cascades. In The World Wide Web Conference, pages 681–692.
- John J Horton. 2023. Large language models as simulated economic agents: What can we learn from homo silicus? arXiv preprint arXiv:2301.07543.
- Panagiotis G Ipeirotis. 2010. Demographics of mechanical turk.
- Human heuristics for ai-generated language are flawed. Proceedings of the National Academy of Sciences, 120(11):e2208839120.
- The shape of and solutions to the mturk quality crisis. Political Science Research and Methods, 8(4):614–629.
- A watermark for large language models. arXiv preprint arXiv:2301.10226.
- Chatgpt: Jack of all trades, master of none. arXiv preprint arXiv:2302.10724.
- Wanli: Worker and ai collaboration for natural language inference dataset creation. arXiv preprint arXiv:2201.05955.
- Chatgpt as a factual inconsistency evaluator for text summarization.
- Using the amazon mechanical turk for transcription of spoken language. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5270–5273. IEEE.
- Morgan N McCredie and Leslie C Morey. 2019. Who are the turkers? a characterization of mturk workers using the personality assessment inventory. Assessment, 26(5):759–766.
- Aaron M Ogletree and Benjamin Katz. 2021. How do older adults recruited using mturk differ from those in a national probability sample? The International Journal of Aging and Human Development, 93(2):700–721.
- OpenAI. 2023. Gpt-4 technical report.
- On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661.
- Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns, 2(11):100336.
- Shifting attention to accuracy can reduce misinformation online. Nature, 592(7855):590–595.
- War of the chatbots: Bard, bing chat, chatgpt, ernie and beyond. the new ai gold rush and its impact on higher education. Journal of Applied Learning and Teaching, 6(1).
- Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156.
- Matthew J Salganik. 2019. Bit by bit: Social research in the digital age. Princeton University Press.
- Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548.
- Oscar Schwartz. 2019. Untold history of ai: How amazon’s mechanical turkers got squeezed inside the machine. IEEE Spectrum.
- The curse of recursion: Training on generated data makes models forget.
- A multi-group analysis of online survey respondent data quality: Comparing a regular usa consumer panel to mturk samples. Journal of Business Research, 69(8):3139–3148.
- Alexander Sorokin and David Forsyth. 2008. Utility data annotation with amazon mechanical turk. In 2008 IEEE computer society conference on computer vision and pattern recognition workshops, pages 1–8. IEEE.
- string2string: A modern python library for string-to-string algorithms. arXiv preprint arXiv:2304.14395.
- Petter Törnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588.
- Ghostbuster: Detecting text ghostwritten by large language models. arXiv preprint arXiv:2305.15047.
- Generating faithful synthetic data with large language models: A case study in computational social science. arXiv preprint arXiv:2305.15041.
- Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.
- Cheat: A large-scale dataset for detecting chatgpt-written abstracts. arXiv preprint arXiv:2304.12008.
- A survey of controllable text generation using transformer-based pre-trained language models. arXiv preprint arXiv:2201.05337.
- Can large language models transform computational social science? arXiv preprint arXiv:2305.03514.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.