Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks (2306.07899v1)

Published 13 Jun 2023 in cs.CL and cs.CY

Abstract: LLMs are remarkable data annotators. They can be used to generate high-fidelity supervised training data, as well as survey and experimental data. With the widespread adoption of LLMs, human gold--standard annotations are key to understanding the capabilities of LLMs and the validity of their results. However, crowdsourcing, an important, inexpensive way to obtain human annotations, may itself be impacted by LLMs, as crowd workers have financial incentives to use LLMs to increase their productivity and income. To investigate this concern, we conducted a case study on the prevalence of LLM usage by crowd workers. We reran an abstract summarization task from the literature on Amazon Mechanical Turk and, through a combination of keystroke detection and synthetic text classification, estimate that 33-46% of crowd workers used LLMs when completing the task. Although generalization to other, less LLM-friendly tasks is unclear, our results call for platforms, researchers, and crowd workers to find new ways to ensure that human data remain human, perhaps using the methodology proposed here as a stepping stone. Code/data: https://github.com/epfl-dlab/GPTurk

An Examination of LLM Utilization in Crowdsourcing Environments: Analysis and Implications

The paper "Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use LLMs for Text Production Tasks" rigorously investigates the increasing reliance on LLMs by crowd workers engaged in text production tasks, particularly on platforms such as Amazon Mechanical Turk (MTurk). Through a detailed case paper, it reveals a significant prevalence of LLM usage among crowd workers, raising concerns about the integrity and authenticity of crowdsourced data that are intended to function as a human gold standard.

Core Investigation and Methods

The authors conducted a case paper focused on a text summarization task originally devised by Horta, involving the summarization of medical research abstracts. Utilizing a combination of keystroke detection and synthetic text classification, the paper quantitatively estimated that 33% to 46% of crowd worker submissions were generated with the help of LLMs.

Critical to their methodology was the development and fine-tuning of a bespoke model capable of distinguishing human-written from LLM-generated text. The model, trained using human text from MTurk and synthetic samples from ChatGPT, achieved impressive accuracy rates in both summary-level and abstract-level datasets. This methodological rigor underscores the potential for LLM detection models tailored to specific task types, which may offer more accurate results than general-purpose solutions.

Results and Implications

The findings from this investigation are significant. They underscore the scope of LLM usage by workers on platforms like MTurk, which may compromise the intended human-centric nature of crowdsourced data. This has far-reaching implications for the validity of data used in research contexts, particularly when human judgment and interpretation are critical. Given the findings, the paper calls for new methodologies and systems to ensure the human origin of data, essential for various scientific and industrial applications.

In terms of broader implications, the paper also raises awareness about future trends as LLM use becomes increasingly normalized. The challenges posed by machine-generated data in educational and information ecosystems need to be addressed, with the potential degradation of recursive LLMs highlighted as a noteworthy concern.

Potential Future Directions

This work opens multiple avenues for future research. One critical aspect is evaluating whether the findings related to text summarization extend to other task types, particularly those intrinsically resistant to LLM synthesis due to complexity or context specificity. Additionally, exploring the evolving interplay between human annotators and LLMs would offer valuable insights into optimizing collaborative data production processes.

In conclusion, this paper provides a comprehensive evaluation of a pressing issue within the space of AI-driven text production, with a robust methodological framework to support its claims. The implications of these findings are crucial for researchers relying on crowdsourced data and signal the necessity for evolving methodologies to adapt to the changing landscape of human and machine collaboration.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Out of one, many: Using language models to simulate human samples. arXiv preprint arXiv:2209.06899.
  2. Crowdsourcing multi-label classification for taxonomy creation. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 1, pages 25–33.
  3. Amazon’s mechanical turk: A new source of inexpensive, yet high-quality, data? Perspectives on psychological science, 6(1):3–5.
  4. Who is mturk? personal characteristics and sample consistency of these online workers. Mental Health, Religion & Culture, 21(9-10):934–944.
  5. Michael Chmielewski and Sarah C Kucker. 2020. An mturk crisis? shifts in data quality and the impact on study results. Social Psychological and Personality Science, 11(4):464–473.
  6. Robert Dale. 2021. Gpt-3: What’s it good for? Natural Language Engineering, 27(1):113–118.
  7. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
  8. Can ai language models replace human participants? Trends in Cognitive Sciences.
  9. Fingerprinting fine-tuned language models in the wild. arXiv preprint arXiv:2106.01703.
  10. A checklist to combat cognitive biases in crowdsourcing. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 9, pages 48–59.
  11. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056.
  12. Mary L Gray and Siddharth Suri. 2019. Ghost work: How to stop Silicon Valley from building a new global underclass. Eamon Dolan Books.
  13. Message distortion in information cascades. In The World Wide Web Conference, pages 681–692.
  14. John J Horton. 2023. Large language models as simulated economic agents: What can we learn from homo silicus? arXiv preprint arXiv:2301.07543.
  15. Panagiotis G Ipeirotis. 2010. Demographics of mechanical turk.
  16. Human heuristics for ai-generated language are flawed. Proceedings of the National Academy of Sciences, 120(11):e2208839120.
  17. The shape of and solutions to the mturk quality crisis. Political Science Research and Methods, 8(4):614–629.
  18. A watermark for large language models. arXiv preprint arXiv:2301.10226.
  19. Chatgpt: Jack of all trades, master of none. arXiv preprint arXiv:2302.10724.
  20. Wanli: Worker and ai collaboration for natural language inference dataset creation. arXiv preprint arXiv:2201.05955.
  21. Chatgpt as a factual inconsistency evaluator for text summarization.
  22. Using the amazon mechanical turk for transcription of spoken language. In 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5270–5273. IEEE.
  23. Morgan N McCredie and Leslie C Morey. 2019. Who are the turkers? a characterization of mturk workers using the personality assessment inventory. Assessment, 26(5):759–766.
  24. Aaron M Ogletree and Benjamin Katz. 2021. How do older adults recruited using mturk differ from those in a national probability sample? The International Journal of Aging and Human Development, 93(2):700–721.
  25. OpenAI. 2023. Gpt-4 technical report.
  26. On the risk of misinformation pollution with large language models. arXiv preprint arXiv:2305.13661.
  27. Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns, 2(11):100336.
  28. Shifting attention to accuracy can reduce misinformation online. Nature, 592(7855):590–595.
  29. War of the chatbots: Bard, bing chat, chatgpt, ernie and beyond. the new ai gold rush and its impact on higher education. Journal of Applied Learning and Teaching, 6(1).
  30. Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156.
  31. Matthew J Salganik. 2019. Bit by bit: Social research in the digital age. Princeton University Press.
  32. Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548.
  33. Oscar Schwartz. 2019. Untold history of ai: How amazon’s mechanical turkers got squeezed inside the machine. IEEE Spectrum.
  34. The curse of recursion: Training on generated data makes models forget.
  35. A multi-group analysis of online survey respondent data quality: Comparing a regular usa consumer panel to mturk samples. Journal of Business Research, 69(8):3139–3148.
  36. Alexander Sorokin and David Forsyth. 2008. Utility data annotation with amazon mechanical turk. In 2008 IEEE computer society conference on computer vision and pattern recognition workshops, pages 1–8. IEEE.
  37. string2string: A modern python library for string-to-string algorithms. arXiv preprint arXiv:2304.14395.
  38. Petter Törnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588.
  39. Ghostbuster: Detecting text ghostwritten by large language models. arXiv preprint arXiv:2305.15047.
  40. Generating faithful synthetic data with large language models: A case study in computational social science. arXiv preprint arXiv:2305.15041.
  41. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  42. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533.
  43. Cheat: A large-scale dataset for detecting chatgpt-written abstracts. arXiv preprint arXiv:2304.12008.
  44. A survey of controllable text generation using transformer-based pre-trained language models. arXiv preprint arXiv:2201.05337.
  45. Can large language models transform computational social science? arXiv preprint arXiv:2305.03514.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Veniamin Veselovsky (17 papers)
  2. Manoel Horta Ribeiro (44 papers)
  3. Robert West (154 papers)
Citations (110)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com