Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation (2405.01299v1)

Published 2 May 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have emerged as powerful support tools across various natural language tasks and a range of application domains. Recent studies focus on exploring their capabilities for data annotation. This paper provides a comparative overview of twelve studies investigating the potential of LLMs in labelling data. While the models demonstrate promising cost and time-saving benefits, there exist considerable limitations, such as representativeness, bias, sensitivity to prompt variations and English language preference. Leveraging insights from these studies, our empirical analysis further examines the alignment between human and GPT-generated opinion distributions across four subjective datasets. In contrast to the studies examining representation, our methodology directly obtains the opinion distribution from GPT. Our analysis thereby supports the minority of studies that are considering diverse perspectives when evaluating data annotation tasks and highlights the need for further research in this direction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Whose opinions matter? perspective-aware models to identify opinions of hate speech victims in abusive language detection.
  2. Dina Almanea and Massimo Poesio. 2022. ArMIS - the Arabic misogyny and sexism corpus with annotator subjective disagreements. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2282–2291, Marseille, France. European Language Resources Association.
  3. Stop measuring calibration when humans disagree. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1892–1915, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  5. Toward a perspectivist turn in ground truthing for predictive computing. Proceedings of the AAAI Conference on Artificial Intelligence, 37(6):6860–6868.
  6. ConvAbuse: Data, analysis, and benchmarks for nuanced abuse detection in conversational AI. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7388–7403, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  7. Is GPT-3 a Good Data Annotator? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11173–11195, Toronto, Canada. Association for Computational Linguistics.
  8. Towards measuring the representation of subjective global opinions in language models.
  9. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30). Publisher: Proceedings of the National Academy of Sciences.
  10. LLMs Accelerate Annotation for Medical Information Extraction. In Proceedings of the 3rd Machine Learning for Health Symposium, pages 82–100. PMLR. ISSN: 2640-3498.
  11. Annollm: Making large language models to be better crowdsourced annotators. arXiv preprint arXiv:2303.16854.
  12. Is ChatGPT better than Human Annotators? Potential and Limitations of ChatGPT in Explaining Implicit Hate Speech. In Companion Proceedings of the ACM Web Conference 2023, WWW ’23 Companion, pages 294–297, New York, NY, USA. Association for Computing Machinery.
  13. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc.
  14. Can Large Language Models Capture Dissenting Human Voices? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4569–4585, Singapore. Association for Computational Linguistics.
  15. SemEval-2023 Task 11: Learning with Disagreements (LeWiDi). In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 2304–2318, Toronto, Canada. Association for Computational Linguistics.
  16. Agreeing to disagree: Annotating offensive language datasets with annotators’ disagreement. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10528–10539, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  17. Exploring the Sensitivity of LLMs’ Decision-Making Capabilities: Insights from Prompt Variations and Hyperparameters. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3711–3716, Singapore. Association for Computational Linguistics.
  18. Are large language models good annotators? In NeurIPS 2023 Workshop on I Can’t Believe It’s Not Better (ICBINB): Failure Modes in the Age of Foundation Models.
  19. What can we learn from collective human opinions on natural language inference data? Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
  20. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, volume 35, pages 27730–27744. Curran Associates, Inc.
  21. Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics, 7:677–694.
  22. Barbara Plank. 2022. The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  23. On releasing annotator-level labels and information in datasets. In Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop, pages 133–138, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  24. Direct preference optimization: Your language model is secretly a reward model. In Thirty-seventh Conference on Neural Information Processing Systems.
  25. Two contrasting data annotation paradigms for subjective NLP tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 175–190, Seattle, United States. Association for Computational Linguistics.
  26. Whose opinions do language models reflect? In Proceedings of the 40th International Conference on Machine Learning, volume 202 of ICML’23, pages 29971–30004. JMLR.org.
  27. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. In The Twelfth International Conference on Learning Representations.
  28. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research.
  29. Do llms exhibit human-like response biases? a case study in survey design.
  30. Llama: Open and efficient foundation language models.
  31. Petter Törnberg. 2023. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning.
  32. Learning from disagreement: A survey. Journal of Artificial Intelligence Research, 72:1385–1470.
  33. Want To Reduce Labeling Cost? GPT-3 Can Help. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4195–4205, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  34. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  35. LLMaAA: Making Large Language Models as Active Annotators. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13088–13103, Singapore. Association for Computational Linguistics.
  36. Can chatgpt reproduce human-generated labels? a study of social computing tasks.
  37. Can large language models transform computational social science? Computational Linguistics, pages 1–55.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Maja Pavlovic (3 papers)
  2. Massimo Poesio (27 papers)
Citations (11)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets