Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

If in a Crowdsourced Data Annotation Pipeline, a GPT-4 (2402.16795v2)

Published 26 Feb 2024 in cs.HC, cs.AI, cs.CL, and cs.LG

Abstract: Recent studies indicated GPT-4 outperforms online crowd workers in data labeling accuracy, notably workers from Amazon Mechanical Turk (MTurk). However, these studies were criticized for deviating from standard crowdsourcing practices and emphasizing individual workers' performances over the whole data-annotation process. This paper compared GPT-4 and an ethical and well-executed MTurk pipeline, with 415 workers labeling 3,177 sentence segments from 200 scholarly articles using the CODA-19 scheme. Two worker interfaces yielded 127,080 labels, which were then used to infer the final labels through eight label-aggregation algorithms. Our evaluation showed that despite best practices, MTurk pipeline's highest accuracy was 81.5%, whereas GPT-4 achieved 83.6%. Interestingly, when combining GPT-4's labels with crowd labels collected via an advanced worker interface for aggregation, 2 out of the 8 algorithms achieved an even higher accuracy (87.5%, 87.0%). Further analysis suggested that, when the crowd's and GPT-4's labeling strengths are complementary, aggregating them could increase labeling accuracy.

An Evaluation of GPT-4 and Crowdsourced Data Annotation Practices

The paper "If in a Crowdsourced Data Annotation Pipeline, a GPT-4" presents a comprehensive paper comparing the performances of GPT-4 and crowdsourcing pipelines in data labeling tasks. The specific focus of the paper was on data labeled using the CODA-19 scheme, categorizing sentence segments in scholarly articles into predefined classes such as Background, Purpose, Method, Finding/Contribution, and Other.

This research adds significant value to existing literature by assessing the robustness of GPT-4, especially in comparison with crowd workers sourced primarily through the Amazon Mechanical Turk (MTurk) platform. Through this rigorous examination involving 415 MTurk workers and utilizing both basic and advanced worker interfaces, the paper carefully examines the efficiency and quality of data annotation by human crowd workers versus the zero-shot capabilities of GPT-4.

The numerical results illuminate several essential points. Despite well-designed crowdsourcing pipelines, the MTurk workers only achieved a highest annotation accuracy of 81.5% in contrast to GPT-4's standalone performance at an accuracy of 83.6%, signifying the superior labeling efficiency of GPT-4. However, an intriguing observation was noted when GPT-4's outputs were aggregated with crowd labels using advanced algorithms like One-Coin Dawid-Skene: the accuracy improved to 87.5%, suggesting that a hybrid model harnessing both human and GPT-4 annotations can surpass individual performances.

The paper's findings underscore the strengths and weaknesses of different approaches to data annotation. In particular, while GPT-4 outperformed human annotators in most instances, when crowd labels were effectively aggregated, they complemented GPT-4's outputs, especially in labeling nuances such as Finding/Contribution classes which GPT-4 found challenging.

The implications of these results extend into broader discussions regarding the future of data annotation practices. The paper suggests that although LLMs like GPT-4 are capable for many annotation tasks, there is still irreplaceable value in human annotation, especially when dealing with complex, nuanced tasks and context-dependent text interpretations. The collaborative approach, leveraging both GPT-4 and human inputs, could become a best practice in designing efficient and accurate data annotation pipelines.

The paper hints at further areas for future exploration, including optimizing prompts for distinct labeling tasks and exploring broad applicability across various domains beyond biomedical literature. Additionally, this research invites questions around the integration of LLM outputs into larger human-computation frameworks and the evolving role of humans in refining, contextualizing, and enhancing AI-generated outputs.

In conclusion, the paper provides a detailed and insightful exploration into the advanced capabilities of GPT-4 relative to traditional crowdsourcing practices. It opens pathways for developing smarter, hybrid approaches leveraging AI efficiencies while emphasizing the irreplaceable insights of human judgment in the data labeling ecosystem. This invites further research into optimizing collaborative strategies, enhancing interface designs, and better leveraging the complementary strengths of humans and AI in data annotation endeavors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. Open-Source Large Language Models Outperform Crowd Workers and Approach ChatGPT in Text-Annotation Tasks. arXiv preprint arXiv:2307.02179 (2023).
  2. Quality control in crowdsourcing systems: Issues and directions. IEEE Internet Computing 17, 2 (2013), 76–81.
  3. Appen. 2021. Calculating Worker Agreement with aggregate (WAWA). https://success.appen.com/hc/en-us/articles/202703205-Calculating-Worker-Agreement-with-Aggregate-Wawa-
  4. Does the whole exceed its parts? the effect of ai explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–16.
  5. Parikshit Bansal and Amit Sharma. 2023. Large Language Models as Annotators: Enhancing Generalization of NLP Models at Minimal Cost. arXiv preprint arXiv:2306.15766 (2023).
  6. Alexander Braylan and Matthew Lease. 2021. Aggregating Complex Annotations via Merging and Matching. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining (Virtual Event, Singapore) (KDD ’21). Association for Computing Machinery, New York, NY, USA, 86–94. https://doi.org/10.1145/3447548.3467411
  7. Antonio Casilli. 2023. This paper is more infuriating than insightful. Three researchers try to demonstrate that chatgpt can replace micro-workers, who are essential for AI training, verification, etc.. there are many holes in their methods, thus their claim cannot be supported.https://t.co/HKQUG4RWTD. https://twitter.com/AntonioCasilli/status/1640633647983108097
  8. ChatGPT to Replace Crowdsourcing of Paraphrases for Intent Classification: Higher Diversity and Comparable Model Robustness. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 1889–1905. https://doi.org/10.18653/v1/2023.emnlp-main.117
  9. SOLVENT: A Mixed Initiative System for Finding Analogies between Research Papers. Proc. ACM Hum.-Comput. Interact. 2, CSCW, Article 31 (nov 2018), 21 pages. https://doi.org/10.1145/3274300
  10. Good Data, Large Data, or No Data? Comparing Three Approaches in Developing Research Aspect Classifiers for Biomedical Papers. In The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks. Association for Computational Linguistics, Toronto, Canada, 103–113. https://doi.org/10.18653/v1/2023.bionlp-1.8
  11. Cheng-Han Chiang and Hung-yi Lee. 2023. Can Large Language Models Be an Alternative to Human Evaluations?. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 15607–15631. https://doi.org/10.18653/v1/2023.acl-long.870
  12. Quality control in crowdsourcing: A survey of quality attributes, assessment techniques, and assurance actions. ACM Computing Surveys (CSUR) 51, 1 (2018), 1–40.
  13. Alexander Philip Dawid and Allan M Skene. 1979. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics) 28, 1 (1979), 20–28.
  14. Investigating data contamination in modern benchmarks for large language models. arXiv preprint arXiv:2311.09783 (2023).
  15. CrowdWorkSheets: Accounting for Individual and Collective Identities Underlying Crowdsourced Dataset Annotation. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (Seoul, Republic of Korea) (FAccT ’22). Association for Computing Machinery, New York, NY, USA, 2342–2351. https://doi.org/10.1145/3531146.3534647
  16. Is GPT-3 a Good Data Annotator?. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:254877171
  17. CrowdTruth 2.0: Quality Metrics for Crowdsourcing with Disagreement. (2018). https://arxiv.org/abs/1808.06080
  18. Empirical methodology for crowdsourcing ground truth. Semantic Web 12, 3 (2021), 403–421.
  19. Out of the bleu: how should we assess quality of the code generation models? Journal of Systems and Software 203 (2023), 111741.
  20. Keep It Simple: Reward and Task Design in Crowdsourcing. In Proceedings of the Biannual Conference of the Italian Chapter of SIGCHI (Trento, Italy) (CHItaly ’13). Association for Computing Machinery, New York, NY, USA, Article 14, 4 pages. https://doi.org/10.1145/2499149.2499168
  21. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056 (2023).
  22. The effects of task instructions in crowdsourcing innovative ideas. Technological Forecasting and Social Change 134 (2018), 35–44.
  23. Annollm: Making large language models to be better crowdsourced annotators. arXiv preprint arXiv:2303.16854 (2023).
  24. Developing a tool for fair and reproducible use of paid crowdsourcing in the digital humanities. In Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature, Stefania Degaetano, Anna Kazantseva, Nils Reiter, and Stan Szpakowicz (Eds.). International Conference on Computational Linguistics, Gyeongju, Republic of Korea, 7–12. https://aclanthology.org/2022.latechclfl-1.2
  25. Learning whom to trust with MACE. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1120–1130.
  26. Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech. In Companion Proceedings of the ACM Web Conference 2023 (, Austin, TX, USA,) (WWW ’23 Companion). Association for Computing Machinery, New York, NY, USA, 294–297. https://doi.org/10.1145/3543873.3587368
  27. CODA-19: Using a Non-Expert Crowd to Annotate Research Aspects on 10,000+ Abstracts in the COVID-19 Open Research Dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020. Association for Computational Linguistics, Online. https://aclanthology.org/2020.nlpcovid19-acl.6
  28. Investigating data contamination for pre-training language models. arXiv preprint arXiv:2401.06059 (2024).
  29. Gabriella Kazai. 2011. In search of quality in crowdsourcing for search engine evaluation. In Advances in Information Retrieval: 33rd European Conference on IR Research, ECIR 2011, Dublin, Ireland, April 18-21, 2011. Proceedings 33. Springer, 165–176.
  30. Jonathan K Kummerfeld. 2021. Quantifying and Avoiding Unfair Qualification Labour in Crowdsourcing. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 343–349.
  31. Jiyi Li. 2024. A Comparative Study on Annotation Quality of Crowdsourcing and LLM via Label Aggregation. In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024).
  32. Hyper questions: Unsupervised targeting of a few experts in crowdsourcing. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1069–1078.
  33. Effective crowd annotation for relation extraction. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 897–906.
  34. BioTABQA: Instruction Learning for Biomedical Table Question Answering. In CEUR Workshop Proceedings, Vol. 3180. CEUR-WS, 291–304.
  35. Qianqian Ma and Alex Olshevsky. 2020. Adversarial crowdsourcing through robust rank-one matrix completion. Advances in Neural Information Processing Systems 33 (2020), 21841–21852.
  36. MDDial: A Multi-turn Differential Diagnosis Dialogue Dataset with Reliability Evaluation. arXiv preprint arXiv:2308.08147 (2023).
  37. Inbal Magar and Roy Schwartz. 2022. Data Contamination: From Memorization to Exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, Dublin, Ireland, 157–165. https://doi.org/10.18653/v1/2022.acl-short.18
  38. Volunteering Versus Work for Pay: Incentives and Tradeoffs in Crowdsourcing. (2013).
  39. CoVERT: A Corpus of Fact-checked Biomedical COVID-19 Tweets. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Jan Odijk, and Stelios Piperidis (Eds.). European Language Resources Association, Marseille, France, 244–257. https://aclanthology.org/2022.lrec-1.26
  40. In-BoXBART: Get Instructions into Biomedical Multi-Task Learning. In Findings of the Association for Computational Linguistics: NAACL 2022, Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (Eds.). Association for Computational Linguistics, Seattle, United States, 112–128. https://doi.org/10.18653/v1/2022.findings-naacl.10
  41. Attention please: Your attention check questions in survey studies can be automatically answered. In Proceedings of The Web Conference 2020. 1182–1193.
  42. Bahareh Rahmanian and Joseph G Davis. 2014. User interface design for crowdsourcing systems. In Proceedings of the 2014 International Working Conference on Advanced Visual Interfaces. 405–408.
  43. On the State of Reporting in Crowdsourcing Experiments and a Checklist to Aid Current Practices. Proc. ACM Hum.-Comput. Interact. 5, CSCW2, Article 387 (oct 2021), 34 pages. https://doi.org/10.1145/3479531
  44. Learning From Crowds. Journal of Machine Learning Research 11, 43 (2010), 1297–1322. http://jmlr.org/papers/v11/raykar10a.html
  45. Steven V Rouse. 2019. Reliability of MTurk data from masters and workers. Journal of Individual Differences (2019).
  46. Salim Sazzed. 2022. BanglaBioMed: A Biomedical Named-Entity Annotated Corpus for Bangla (Bengali). In Proceedings of the 21st Workshop on Biomedical Language Processing, Dina Demner-Fushman, Kevin Bretonnel Cohen, Sophia Ananiadou, and Junichi Tsujii (Eds.). Association for Computational Linguistics, Dublin, Ireland, 323–329. https://doi.org/10.18653/v1/2022.bionlp-1.31
  47. Reading on LCD vs e-Ink displays: effects on fatigue and visual strain. Ophthalmic and Physiological Optics 32, 5 (2012), 367–374.
  48. Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 conference on empirical methods in natural language processing. 254–263.
  49. Bayesian modeling of human–AI complementarity. Proceedings of the National Academy of Sciences 119, 11 (2022), e2111547119.
  50. Toloka. [n. d.]. ZeroBasedSkill - crowd-kit: Toloka documentation. https://toloka.ai/docs/crowd-kit/reference/crowdkit.aggregation.classification.zero_based_skill.ZeroBasedSkill/
  51. Utility of human-computer interactions: Toward a science of preference measurement. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2275–2284.
  52. Petter Törnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588 (2023).
  53. Turkopticon. 2023. Beware the hype: Chatgpt didn’t replace human data annotators. https://news.techworkerscoalition.org/2023/04/04/issue-5/
  54. A general-purpose crowdsourcing computational quality control toolkit for python. In The Ninth AAAI Conference on Human Computation and Crowdsourcing: Works-in-Progress and Demonstration Track (HCOMP 2021).
  55. Learning from Crowds with Crowd-Kit. arXiv:2109.08584 [cs.HC] https://arxiv.org/abs/2109.08584
  56. The activecrowdtoolkit: An open-source tool for benchmarking active learning algorithms for crowdsourcing research. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Vol. 3. 44–45.
  57. Prevalence and prevention of large language model use in crowd work. arXiv preprint arXiv:2310.15683 (2023).
  58. CORD-19: The COVID-19 Open Research Dataset. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020. Association for Computational Linguistics, Online. https://aclanthology.org/2020.nlpcovid19-acl.1
  59. Want To Reduce Labeling Cost? GPT-3 Can Help. In Findings of the Association for Computational Linguistics: EMNLP 2021. 4195–4205.
  60. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. Advances in neural information processing systems 22 (2009).
  61. LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs. arXiv preprint arXiv:2307.10168 (2023).
  62. Chloe Xiang. 2023. CHATGPT can replace the underpaid workers who train AI, researchers say. https://www.vice.com/en/article/ak3dwk/chatgpt-can-replace-the-underpaid-workers-who-train-ai-researchers-say
  63. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. Advances in neural information processing systems 27 (2014).
  64. LMTurk: Few-Shot Learners as Crowdsourcing Workers in a Language-Model-as-a-Service Framework. In Findings of the Association for Computational Linguistics: NAACL 2022, Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (Eds.). Association for Computational Linguistics, Seattle, United States, 675–692. https://doi.org/10.18653/v1/2022.findings-naacl.51
  65. Truth Inference in Crowdsourcing: Is the Problem Solved? Proc. VLDB Endow. 10, 5 (jan 2017), 541–552. https://doi.org/10.14778/3055540.3055547
  66. Can chatgpt reproduce human-generated labels? a study of social computing tasks. arXiv preprint arXiv:2304.10145 (2023).
  67. Extracting a Knowledge Base of COVID-19 Events from Social Media. In Proceedings of the 29th International Conference on Computational Linguistics, Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon Na (Eds.). International Committee on Computational Linguistics, Gyeongju, Republic of Korea, 3810–3823. https://aclanthology.org/2022.coling-1.335
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zeyu He (11 papers)
  2. Chieh-Yang Huang (24 papers)
  3. Chien-Kuang Cornelia Ding (2 papers)
  4. Shaurya Rohatgi (10 papers)
  5. Ting-Hao 'Kenneth' Huang (42 papers)
Citations (15)