A Comparative Study on Annotation Quality of Crowdsourcing and LLM via Label Aggregation (2401.09760v1)
Abstract: Whether LLMs can outperform crowdsourcing on the data annotation task is attracting interest recently. Some works verified this issue with the average performance of individual crowd workers and LLM workers on some specific NLP tasks by collecting new datasets. However, on the one hand, existing datasets for the studies of annotation quality in crowdsourcing are not yet utilized in such evaluations, which potentially provide reliable evaluations from a different viewpoint. On the other hand, the quality of these aggregated labels is crucial because, when utilizing crowdsourcing, the estimated labels aggregated from multiple crowd labels to the same instances are the eventually collected labels. Therefore, in this paper, we first investigate which existing crowdsourcing datasets can be used for a comparative study and create a benchmark. We then compare the quality between individual crowd labels and LLM labels and make the evaluations on the aggregated labels. In addition, we propose a Crowd-LLM hybrid label aggregation method and verify the performance. We find that adding LLM labels from good LLMs to existing crowdsourcing datasets can enhance the quality of the aggregated labels of the datasets, which is also higher than the quality of LLM labels themselves.
- “Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks,” in arXiv, 2023.
- “Can chatgpt reproduce human-generated labels? a study of social computing tasks,” in arXiv, 2023.
- “Chatgpt outperforms crowd-workers for text-annotation tasks,” in arXiv, 2023.
- Petter Törnberg, “Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning,” in arXiv, 2023.
- “Chatgpt to replace crowdsourcing of paraphrases for intent classification: Higher diversity and comparable model robustness,” in arXiv, 2023.
- “Weather sentiment - amazon mechanical turk dataset,” 2015.
- “The multidimensional wisdom of crowds,” in NIPS, 2010, pp. 2424–2432.
- “Cheap and fast—but is it good?: Evaluating non-expert annotations for natural language tasks,” in EMNLP, 2008, pp. 254–263.
- “Maximum likelihood estimation of observer error-rates using the em algorithm,” Applied Statistics, vol. 28, no. 1, pp. 20–28, 1979.
- “Whose vote should count more: Optimal integration of labels from labelers of unknown expertise,” in NIPS, 2009, pp. 2035–2043.
- “Community-based bayesian aggregation models for crowdsourcing,” in WWW, 2014, pp. 155–164.
- “Hyper questions: Unsupervised targeting of a few experts in crowdsourcing,” in CIKM, 2017, pp. 1069–1078.
- “Incorporating worker similarity for label aggregation in crowdsourcing,” in ICANN, 2018, pp. 596–606.
- “Performance as a constraint: An improved wisdom of crowds using performance regularization,” in IJCAI, 2020, pp. 1534–1541.
- “Rank analysis of incomplete block designs: I. the method of paired comparisons,” Biometrika, vol. 39, no. 3/4, pp. 324–345, 1952.
- “Pairwise ranking aggregation in a crowdsourced setting,” in WSDM, 2013, pp. 193–202.
- “Rank aggregation via heterogeneous thurstone preference models,” in AAAI, 2020, vol. 34, pp. 4353–4360.
- “Simultaneous clustering and ranking from pairwise comparisons,” in IJCAI. 7 2018, pp. 1554–1560, International Joint Conferences on Artificial Intelligence Organization.
- Jiyi Li, “Context-based collective preference aggregation for prioritizing crowd opinions in social decision-making,” in WWW, 2022, pp. 2657–2667.
- “A confidence-aware approach for truth discovery on long-tail data,” Proc. VLDB Endow., vol. 8, no. 4, pp. 425–436, 2014.
- Jiyi Li, “Crowdsourced text sequence aggregation based on hybrid reliability and representation,” in SIGIR, 2020, pp. 1761–1764.
- “Teach me to explain: A review of datasets for explainable natural language processing,” in Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021, vol. 1.
- “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” March 2023.
- “Label aggregation for crowdsourced triplet similarity comparisons,” in ICONIP, 2021, pp. 176–185.
- “Multiview representation learning from crowdsourced triplet comparisons,” in WWW, 2023, p. 3827–3836.
- “A dataset of crowdsourced word sequences: Collections and answer aggregation for ground truth creation,” in AnnoNLP2019, Nov. 2019, pp. 24–28.