If in a Crowdsourced Data Annotation Pipeline, a GPT-4 (2402.16795v2)

Published 26 Feb 2024 in cs.HC, cs.AI, cs.CL, and cs.LG

Abstract: Recent studies indicated GPT-4 outperforms online crowd workers in data labeling accuracy, notably workers from Amazon Mechanical Turk (MTurk). However, these studies were criticized for deviating from standard crowdsourcing practices and emphasizing individual workers' performances over the whole data-annotation process. This paper compared GPT-4 and an ethical and well-executed MTurk pipeline, with 415 workers labeling 3,177 sentence segments from 200 scholarly articles using the CODA-19 scheme. Two worker interfaces yielded 127,080 labels, which were then used to infer the final labels through eight label-aggregation algorithms. Our evaluation showed that despite best practices, MTurk pipeline's highest accuracy was 81.5%, whereas GPT-4 achieved 83.6%. Interestingly, when combining GPT-4's labels with crowd labels collected via an advanced worker interface for aggregation, 2 out of the 8 algorithms achieved an even higher accuracy (87.5%, 87.0%). Further analysis suggested that, when the crowd's and GPT-4's labeling strengths are complementary, aggregating them could increase labeling accuracy.

PDF HTML Abstract

An Evaluation of GPT-4 and Crowdsourced Data Annotation Practices

The paper "If in a Crowdsourced Data Annotation Pipeline, a GPT-4" presents a comprehensive paper comparing the performances of GPT-4 and crowdsourcing pipelines in data labeling tasks. The specific focus of the paper was on data labeled using the CODA-19 scheme, categorizing sentence segments in scholarly articles into predefined classes such as Background, Purpose, Method, Finding/Contribution, and Other.

This research adds significant value to existing literature by assessing the robustness of GPT-4, especially in comparison with crowd workers sourced primarily through the Amazon Mechanical Turk (MTurk) platform. Through this rigorous examination involving 415 MTurk workers and utilizing both basic and advanced worker interfaces, the paper carefully examines the efficiency and quality of data annotation by human crowd workers versus the zero-shot capabilities of GPT-4.

The numerical results illuminate several essential points. Despite well-designed crowdsourcing pipelines, the MTurk workers only achieved a highest annotation accuracy of 81.5% in contrast to GPT-4's standalone performance at an accuracy of 83.6%, signifying the superior labeling efficiency of GPT-4. However, an intriguing observation was noted when GPT-4's outputs were aggregated with crowd labels using advanced algorithms like One-Coin Dawid-Skene: the accuracy improved to 87.5%, suggesting that a hybrid model harnessing both human and GPT-4 annotations can surpass individual performances.

The paper's findings underscore the strengths and weaknesses of different approaches to data annotation. In particular, while GPT-4 outperformed human annotators in most instances, when crowd labels were effectively aggregated, they complemented GPT-4's outputs, especially in labeling nuances such as Finding/Contribution classes which GPT-4 found challenging.

The implications of these results extend into broader discussions regarding the future of data annotation practices. The paper suggests that although LLMs like GPT-4 are capable for many annotation tasks, there is still irreplaceable value in human annotation, especially when dealing with complex, nuanced tasks and context-dependent text interpretations. The collaborative approach, leveraging both GPT-4 and human inputs, could become a best practice in designing efficient and accurate data annotation pipelines.

The paper hints at further areas for future exploration, including optimizing prompts for distinct labeling tasks and exploring broad applicability across various domains beyond biomedical literature. Additionally, this research invites questions around the integration of LLM outputs into larger human-computation frameworks and the evolving role of humans in refining, contextualizing, and enhancing AI-generated outputs.

In conclusion, the paper provides a detailed and insightful exploration into the advanced capabilities of GPT-4 relative to traditional crowdsourcing practices. It opens pathways for developing smarter, hybrid approaches leveraging AI efficiencies while emphasizing the irreplaceable insights of human judgment in the data labeling ecosystem. This invites further research into optimizing collaborative strategies, enhancing interface designs, and better leveraging the complementary strengths of humans and AI in data annotation endeavors.