Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks (2306.07899v1)

Published 13 Jun 2023 in cs.CL and cs.CY

Abstract: LLMs are remarkable data annotators. They can be used to generate high-fidelity supervised training data, as well as survey and experimental data. With the widespread adoption of LLMs, human gold--standard annotations are key to understanding the capabilities of LLMs and the validity of their results. However, crowdsourcing, an important, inexpensive way to obtain human annotations, may itself be impacted by LLMs, as crowd workers have financial incentives to use LLMs to increase their productivity and income. To investigate this concern, we conducted a case study on the prevalence of LLM usage by crowd workers. We reran an abstract summarization task from the literature on Amazon Mechanical Turk and, through a combination of keystroke detection and synthetic text classification, estimate that 33-46% of crowd workers used LLMs when completing the task. Although generalization to other, less LLM-friendly tasks is unclear, our results call for platforms, researchers, and crowd workers to find new ways to ensure that human data remain human, perhaps using the methodology proposed here as a stepping stone. Code/data: https://github.com/epfl-dlab/GPTurk

PDF HTML Abstract

An Examination of LLM Utilization in Crowdsourcing Environments: Analysis and Implications

The paper "Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use LLMs for Text Production Tasks" rigorously investigates the increasing reliance on LLMs by crowd workers engaged in text production tasks, particularly on platforms such as Amazon Mechanical Turk (MTurk). Through a detailed case paper, it reveals a significant prevalence of LLM usage among crowd workers, raising concerns about the integrity and authenticity of crowdsourced data that are intended to function as a human gold standard.

Core Investigation and Methods

The authors conducted a case paper focused on a text summarization task originally devised by Horta, involving the summarization of medical research abstracts. Utilizing a combination of keystroke detection and synthetic text classification, the paper quantitatively estimated that 33% to 46% of crowd worker submissions were generated with the help of LLMs.

Critical to their methodology was the development and fine-tuning of a bespoke model capable of distinguishing human-written from LLM-generated text. The model, trained using human text from MTurk and synthetic samples from ChatGPT, achieved impressive accuracy rates in both summary-level and abstract-level datasets. This methodological rigor underscores the potential for LLM detection models tailored to specific task types, which may offer more accurate results than general-purpose solutions.

Results and Implications

The findings from this investigation are significant. They underscore the scope of LLM usage by workers on platforms like MTurk, which may compromise the intended human-centric nature of crowdsourced data. This has far-reaching implications for the validity of data used in research contexts, particularly when human judgment and interpretation are critical. Given the findings, the paper calls for new methodologies and systems to ensure the human origin of data, essential for various scientific and industrial applications.

In terms of broader implications, the paper also raises awareness about future trends as LLM use becomes increasingly normalized. The challenges posed by machine-generated data in educational and information ecosystems need to be addressed, with the potential degradation of recursive LLMs highlighted as a noteworthy concern.

Potential Future Directions

This work opens multiple avenues for future research. One critical aspect is evaluating whether the findings related to text summarization extend to other task types, particularly those intrinsically resistant to LLM synthesis due to complexity or context specificity. Additionally, exploring the evolving interplay between human annotators and LLMs would offer valuable insights into optimizing collaborative data production processes.

In conclusion, this paper provides a comprehensive evaluation of a pressing issue within the space of AI-driven text production, with a robust methodological framework to support its claims. The implications of these findings are crucial for researchers relying on crowdsourced data and signal the necessity for evolving methodologies to adapt to the changing landscape of human and machine collaboration.