The Perspectivist Paradigm Shift: Assumptions and Challenges of Capturing Human Labels
Abstract: Longstanding data labeling practices in machine learning involve collecting and aggregating labels from multiple annotators. But what should we do when annotators disagree? Though annotator disagreement has long been seen as a problem to minimize, new perspectivist approaches challenge this assumption by treating disagreement as a valuable source of information. In this position paper, we examine practices and assumptions surrounding the causes of disagreement--some challenged by perspectivist approaches, and some that remain to be addressed--as well as practical and normative challenges for work operating under these assumptions. We conclude with recommendations for the data labeling pipeline and avenues for future research engaging with subjectivity and disagreement.
- Consistency is Key: Disentangling Label Variation in Natural Language Processing with Intra-Annotator Agreement. ArXiv:2301.10684 [cs].
- Philip E Agre. 2014. Toward a critical technical practice: Lessons learned in trying to reform AI. In Social science, technical systems, and cooperative work, pages 131–157. Psychology Press.
- Whose opinions matter? perspective-aware models to identify opinions of hate speech victims in abusive language detection.
- Identifying and Measuring Annotator Bias Based on Annotators’ Demographic Characteristics. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 184–190, Online. Association for Computational Linguistics.
- DICES dataset: Diversity in conversational ai evaluation for safety.
- Lora Aroyo and Chris Welty. 2014. The Three Sides of CrowdTruth. Human Computation, 1(1).
- Kenneth J Arrow. 1977. Social Choice and Individual Values, 2 edition. Cowles Foundation Monographs. Yale University Press, New Haven, CT.
- Ron Artstein. 2017. Inter-annotator agreement. Handbook of linguistic annotation, pages 297–313.
- Stop measuring calibration when humans disagree. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1892–1915, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Judging facts, judging norms: Training machine learning models to judge humans requires a modified approach to labeling data. Science Advances, 9(19):eabq0701. Publisher: American Association for the Advancement of Science.
- We need to consider disagreement in evaluation. In Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future, pages 15–21, Online. Association for Computational Linguistics.
- Analyzing the effects of annotator gender across NLP tasks. In Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022, pages 10–19, Marseille, France. European Language Resources Association.
- The values encoded in machine learning research. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22. ACM.
- Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online. Association for Computational Linguistics.
- Alan L. Boegehold. 1963. Toward a study of athenian voting procedure. Hesperia: The Journal of the American School of Classical Studies at Athens, 32(4):366–374.
- Geoffrey C Bowker and Susan Leigh Star. 2000. Sorting Things Out: Classification and Its Consequences. MIT Press, London, England.
- Toward a perspectivist turn in ground truthing for predictive computing. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’23/IAAI’23/EAAI’23. AAAI Press.
- David J Chalmers. 1997. The conscious mind. Philosophy of Mind. Oxford University Press, New York, NY.
- Herbert H. Clark and Thomas B. Carlson. 1982. Hearers and speech acts. Language, 58(2):332.
- Dealing with Disagreements: Looking Beyond the Majority Vote in Subjective Annotations. Transactions of the Association for Computational Linguistics, 10:92–110.
- The CommitmentBank: Investigating projection in naturally occurring discourse.
- The participatory turn in ai design: Theoretical foundations and the current state of practice. In Proceedings of the 3rd ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization, EAAMO ’23, New York, NY, USA. Association for Computing Machinery.
- GoEmotions: A dataset of fine-grained emotions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4040–4054, Online. Association for Computational Linguistics.
- You are what you annotate: Towards better models through annotator representations. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 12475–12498, Singapore. Association for Computational Linguistics.
- Addressing age-related bias in sentiment analysis. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI ’18, page 1–14, New York, NY, USA. Association for Computing Machinery.
- Mary Douglas. 1978. Purity and danger: an analysis of the concepts of pollution and taboo, repr edition. Routledge, London. OCLC: 248038797.
- A crowdsourced frame disambiguation corpus with ambiguity. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2164–2170, Minneapolis, Minnesota. Association for Computational Linguistics.
- Elizabeth Edenberg and Alexandra Wood. 2023. Disambiguating Algorithmic Bias: From Neutrality to Justice. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’23, pages 691–704, New York, NY, USA. Association for Computing Machinery.
- Fairness in recommender systems. In Recommender Systems Handbook, pages 679–707. Springer.
- Allan Feldman and Roberto Serrano. 2006. Welfare economics and social choice theory, 2 edition. Springer, New York, NY.
- When the majority is wrong: Modeling annotator disagreement for subjective tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6715–6726, Singapore. Association for Computational Linguistics.
- Lucie Flek. 2020. Returning the N to NLP: Towards Contextually Personalized Classification Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7828–7838, Online. Association for Computational Linguistics.
- Beyond Black & White: Leveraging Annotator Disagreement via Soft-Label Multi-Task Learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2591–2597, Online. Association for Computational Linguistics.
- Directions for NLP practices applied to online hate speech detection. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11794–11805, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Batya Friedman. 1996. Value-sensitive design. interactions, 3(6):16–23.
- Datasheets for datasets. Communications of the ACM, 64(12):86–92.
- Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1161–1166, Hong Kong, China. Association for Computational Linguistics.
- Erving Goffman. 1976. Replies and responses. Language in Society, 5(3):257–313.
- Jury Learning: Integrating Dissenting Voices into Machine Learning Models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, CHI ’22, pages 1–19, New York, NY, USA. Association for Computing Machinery.
- Is your toxicity my toxicity? exploring the impact of rater identity on toxicity annotation. Proc. ACM Hum.-Comput. Interact., 6(CSCW2).
- Stuart Hall et al. 1997. The spectacle of the other. Representation: Cultural representations and signifying practices, 7.
- Anna Lauren Hoffmann. 2020. Terms of inclusion: Data, discourse, violence. New Media & Society, 23(12):3539–3556.
- Data quality from crowdsourcing: a study of annotation selection criteria. In Proceedings of the NAACL HLT 2009 workshop on active learning for natural language processing, pages 27–35.
- Incorporating worker perspectives into mturk annotation practices for NLP.
- Understanding and mitigating worker biases in the crowdsourced collection of subjective judgments. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pages 1–12.
- Frank Jackson. 1982. Epiphenomenal qualia. The Philosophical Quarterly (1950-), 32(127):127–136.
- Nan-Jiang Jiang and Marie-Catherine de Marneffe. 2022. Investigating reasons for disagreement in natural language inference. Transactions of the Association for Computational Linguistics, 10:1357–1374.
- A hunt for the snark: Annotator diversity in data practices. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, CHI ’23, New York, NY, USA. Association for Computing Machinery.
- The empty signifier problem: Towards clearer paradigms for operationalising "alignment" in large language models.
- Tahu Kukutai and John Taylor, editors. 2016. Indigenous Data Sovereignty: Toward an agenda, volume 38. ANU Press.
- Reconsidering annotator disagreement about racist language: Noise or signal? In Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media, pages 81–90, Online. Association for Computational Linguistics.
- Josh Lepawsky. 2019. No insides on the outsides. Discard Studies.
- Clarence Irving Lewis. 1930. Mind and the world-order. International Journal of Ethics, 40(4):550–556.
- Towards fair truth discovery from biased crowdsourced answers. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 599–607, New York, NY, USA. Association for Computing Machinery.
- Fairness in recommendation: Foundations, methods, and applications. ACM Transactions on Intelligent Systems and Technology, 14(5):1–48.
- Data statements: From technical concept to community practice. ACM J. Responsib. Comput., 1(1).
- Milagros Miceli and Julian Posada. 2022. The data-production dispositif. Proc. ACM Hum.-Comput. Interact., 6(CSCW2).
- Michael J Muller and Sarah Kuhn. 1993. Participatory design. Communications of the ACM, 36(6):24–28.
- Crowdsourcing subjective annotations using pairwise comparisons reduces bias and error compared to the majority-vote method. Proc. ACM Hum.-Comput. Interact., 7(CSCW2).
- What can we learn from collective human opinions on natural language inference data? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9131–9143, Online. Association for Computational Linguistics.
- Stefanie Nowak and Stefan Rüger. 2010. How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multi-label image annotation. In Proceedings of the international conference on Multimedia information retrieval, pages 557–566.
- The ecological fallacy in annotation: Modeling human label variation goes beyond sociodemographics. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1017–1029, Toronto, Canada. Association for Computational Linguistics.
- Is a picture of a bird a bird: Policy recommendations for dealing with ambiguity in machine vision models. ArXiv:2306.15777 [cs].
- Annotating social media data from vulnerable populations: Evaluating disagreement between domain experts and graduate student annotators. In Hawaii International Conference on System Sciences.
- Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics, 7:677–694.
- Jiaxin Pei and David Jurgens. 2023. When Do Annotator Demographics Matter? Measuring the Influence of Annotator Demographics with the POPQUORN Dataset.
- Pew Research Center. 2016. Research in the crowdsourcing age, a case study.
- Barbara Plank. 2022. The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Unifying data perspectivism and personalization: An application to social norms. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7391–7402, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Maja Popović. 2021. Agree to disagree: Analysis of inter-annotator disagreements in human evaluation of machine translation output. In Proceedings of the 25th Conference on Computational Natural Language Learning, pages 234–243, Online. Association for Computational Linguistics.
- On releasing annotator-level labels and information in datasets. In Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop, pages 133–138, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Survey equivalence: A procedure for measuring classifier accuracy against human labels. CoRR, abs/2106.01254.
- Two contrasting data annotation paradigms for subjective NLP tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 175–190, Seattle, United States. Association for Computational Linguistics.
- The measuring hate speech corpus: Leveraging rasch measurement theory for data perspectivism. In Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022, pages 83–94, Marseille, France. European Language Resources Association.
- NLPositionality: Characterizing design biases of datasets and models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9080–9102, Toronto, Canada. Association for Computational Linguistics.
- The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1668–1678, Florence, Italy. Association for Computational Linguistics.
- Do datasets have politics? disciplinary values in computer vision dataset development. Proc. ACM Hum.-Comput. Interact., 5(CSCW2).
- Fairness and abstraction in sociotechnical systems. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, page 59–68, New York, NY, USA. Association for Computing Machinery.
- Amartya Sen. 2018. Collective choice and social welfare. Harvard University Press.
- Subtle misogyny detection and mitigation: An expert-annotated dataset. In Socially Responsible Language Modelling Research (SoLaR) Workshop.
- Cheap and fast – but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, pages 254–263, Honolulu, Hawaii. Association for Computational Linguistics.
- Seeing infrastructure: race, facial recognition and the politics of data. Cultural Studies, 35(4-5):833–853.
- Disembodied Machine Learning: On the Illusion of Objectivity in NLP.
- Being right for whose right reasons? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1033–1054, Toronto, Canada. Association for Computational Linguistics.
- Nanna Thylstrup and Zeerak Talat. 2020. Detecting ‘Dirt’ and ‘Toxicity’: Rethinking Content Moderation as Pollution Behaviour. SSRN Electronic Journal.
- A Case for Soft Loss Functions. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, 8:173–177.
- Everyone’s voice matters: Quantifying annotation disagreement using demographic information. Proceedings of the AAAI Conference on Artificial Intelligence, 37(12):14523–14530.
- A survey on the fairness of recommender systems. ACM Transactions on Information Systems, 41:1 – 43.
- Collective Human Opinions in Semantic Textual Similarity. Transactions of the Association for Computational Linguistics, 11:997–1013.
- Langdon Winner. 1980. Do artifacts have politics? Daedalus, 109(1):121–136.
- Privacy-preserving machine learning: Methods, challenges and directions. ArXiv, abs/2108.04417.
- A needle in a haystack: An analysis of high-agreement workers on MTurk for summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14944–14982, Toronto, Canada. Association for Computational Linguistics.
- Distributed NLI: Learning to predict human opinion distributions for language reasoning. In Findings of the Association for Computational Linguistics: ACL 2022, pages 972–987, Dublin, Ireland. Association for Computational Linguistics.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.