Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Ecological Fallacy in Annotation: Modelling Human Label Variation goes beyond Sociodemographics (2306.11559v1)

Published 20 Jun 2023 in cs.CL

Abstract: Many NLP tasks exhibit human label variation, where different annotators give different labels to the same texts. This variation is known to depend, at least in part, on the sociodemographics of annotators. Recent research aims to model individual annotator behaviour rather than predicting aggregated labels, and we would expect that sociodemographic information is useful for these models. On the other hand, the ecological fallacy states that aggregate group behaviour, such as the behaviour of the average female annotator, does not necessarily explain individual behaviour. To account for sociodemographics in models of individual annotator behaviour, we introduce group-specific layers to multi-annotator models. In a series of experiments for toxic content detection, we find that explicitly accounting for sociodemographic attributes in this way does not significantly improve model performance. This result shows that individual annotation behaviour depends on much more than just sociodemographics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022. European Language Resources Association, Marseille, France.
  2. Modeling annotator perspective and polarized opinions to improve hate speech detection. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 8, pages 151–154.
  3. Whose opinions matter? perspective-aware models to identify opinions of hate speech victims in abusive language detection. Preprint arXiv:2106.15896.
  4. Identifying and measuring annotator bias based on annotators’ demographic characteristics. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 184–190, Online. Association for Computational Linguistics.
  5. We need to consider disagreement in evaluation. In Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future, pages 15–21, Online. Association for Computational Linguistics.
  6. Analyzing the effects of annotator gender across NLP tasks. In Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022, pages 10–19, Marseille, France. European Language Resources Association.
  7. Like trainer, like bot? inheritance of bias in algorithmic content moderation. In Social Informatics, Lecture Notes in Computer Science, pages 405–415. Springer International Publishing.
  8. ConvAbuse: Data, analysis, and benchmarks for nuanced abuse detection in conversational AI. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7388–7403, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  9. Kimberle Crenshaw. 1989. Demarginalizing the intersection of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and antiracist politics. University of Chicago Legal Forum, 1989(1):Article 8.
  10. Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics, 10:92–110.
  11. Replicability analysis for natural language processing: Testing significance with multiple datasets. Transactions of the Association for Computational Linguistics, 5:471–486.
  12. The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1383–1392, Melbourne, Australia. Association for Computational Linguistics.
  13. Elizabeth Excell and Noura Al Moubayed. 2021. Towards equal gender representation in the annotations of toxic language detection. In Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing, pages 55–65, Online. Association for Computational Linguistics.
  14. Beyond black & white: Leveraging annotator disagreement via soft-label multi-task learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2591–2597, Online. Association for Computational Linguistics.
  15. Hard and soft evaluation of NLP models with BOOtSTrap SAmpling - BooStSa. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 127–134, Dublin, Ireland. Association for Computational Linguistics.
  16. David A. Freedman. 2015. Ecological inference. In James D. Wright, editor, International Encyclopedia of the Social & Behavioral Sciences (Second Edition), pages 868–870. Elsevier.
  17. Jury learning: Integrating dissenting voices into machine learning models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, CHI ’22, pages 1–19. Association for Computing Machinery.
  18. Is your toxicity my toxicity? exploring the impact of rater identity on toxicity annotation. Proceedings of the ACM on Human-Computer Interaction, 6:1–28.
  19. Dirk Hovy and Diyi Yang. 2021. The importance of modeling social factors of language: Theory and practice. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 588–602, Online. Association for Computational Linguistics.
  20. Emily Jamison and Iryna Gurevych. 2015. Noise or additional information? leveraging crowdsource annotation item agreement for natural language tasks. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 291–297, Lisbon, Portugal. Association for Computational Linguistics.
  21. Understanding international perceptions of the severity of harmful content online. PLOS ONE, 16(8).
  22. Designing toxic content classification for a diversity of perspectives. In Seventeenth Symposium on Usable Privacy and Security (SOUPS 2021), pages 299–318. USENIX Association.
  23. Reconsidering annotator disagreement about racist language: Noise or signal? In Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media, pages 81–90, Online. Association for Computational Linguistics.
  24. Roberta: A robustly optimized bert pretraining approach. Preprint arXiv:1907.11692.
  25. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
  26. Barbara Plank. 2022. The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  27. Learning part-of-speech taggers with inter-annotator agreement loss. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 742–751, Gothenburg, Sweden. Association for Computational Linguistics.
  28. On releasing annotator-level labels and information in datasets. In Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop, pages 133–138, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  29. W. S. Robinson. 1950. Ecological correlations and the behavior of individuals. American Sociological Review, 15(3):351–357.
  30. Two contrasting data annotation paradigms for subjective NLP tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 175–190, Seattle, United States. Association for Computational Linguistics.
  31. The measuring hate speech corpus: Leveraging rasch measurement theory for data perspectivism. In Proceedings of the 1st Workshop on Perspectivist Approaches to NLP @LREC2022, pages 83–94, Marseille, France. European Language Resources Association.
  32. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing @ NeurIPS 2019.
  33. The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1668–1678, Florence, Italy. Association for Computational Linguistics.
  34. Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5884–5906, Seattle, United States. Association for Computational Linguistics.
  35. Qinlan Shen and Carolyn Rose. 2021. What sounds “right” to me? experiential factors in the perception of political ideology. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1762–1771, Online. Association for Computational Linguistics.
  36. Learning from disagreement: A survey. Journal of Artificial Intelligence Research, 72:1385–1470.
  37. Towards intersectionality in machine learning: Including more identities, handling underrepresentation, and performing evaluation. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, pages 336–349. Association for Computing Machinery.
  38. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Matthias Orlikowski (3 papers)
  2. Paul Röttger (37 papers)
  3. Philipp Cimiano (25 papers)
  4. Dirk Hovy (57 papers)
Citations (24)