Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Corpus Considerations for Annotator Modeling and Scaling (2404.02340v2)

Published 2 Apr 2024 in cs.CL

Abstract: Recent trends in natural language processing research and annotation tasks affirm a paradigm shift from the traditional reliance on a single ground truth to a focus on individual perspectives, particularly in subjective tasks. In scenarios where annotation tasks are meant to encompass diversity, models that solely rely on the majority class labels may inadvertently disregard valuable minority perspectives. This oversight could result in the omission of crucial information and, in a broader context, risk disrupting the balance within larger ecosystems. As the landscape of annotator modeling unfolds with diverse representation techniques, it becomes imperative to investigate their effectiveness with the fine-grained features of the datasets in view. This study systematically explores various annotator modeling techniques and compares their performance across seven corpora. From our findings, we show that the commonly used user token model consistently outperforms more complex models. We introduce a composite embedding approach and show distinct differences in which model performs best as a function of the agreement with a given dataset. Our findings shed light on the relationship between corpus statistics and annotator modeling performance, which informs future work on corpus construction and perspectivist NLP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Whose opinions matter? perspective-aware models to identify opinions of hate speech victims in abusive language detection. ArXiv preprint, abs/2106.15896.
  2. Dina Almanea and Massimo Poesio. 2022. ArMIS - the Arabic misogyny and sexism corpus with annotator subjective disagreements. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2282–2291, Marseille, France.
  3. Lora Aroyo and Chris Welty. 2015. Truth is a lie: Crowd truth and the seven myths of human annotation. AI Mag., 36:15–24.
  4. An agreement measure for determining inter-annotator reliability of human judgements on affective text. In Coling 2008: Proceedings of the workshop on Human Judgements in Computational Linguistics, pages 58–65, Manchester, UK.
  5. Toward a perspectivist turn in ground truthing for predictive computing. Proceedings of the AAAI Conference on Artificial Intelligence, 37(6):6860–6868.
  6. Confidence-based ensembling of perspective-aware models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3496–3507, Singapore.
  7. ConvAbuse: Data, analysis, and benchmarks for nuanced abuse detection in conversational AI. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7388–7403, Online and Punta Cana, Dominican Republic.
  8. Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics, 10:92–110.
  9. GoEmotions: A dataset of fine-grained emotions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4040–4054, Online.
  10. You are what you annotate: Towards better models through annotator representations. ArXiv preprint, abs/2305.14663.
  11. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota.
  12. Paul Ekman et al. 1999. Basic emotions. Handbook of cognition and emotion, 98(45-60):16.
  13. Learning how to active learn: A deep reinforcement learning approach. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 595–605, Copenhagen, Denmark.
  14. When the majority is wrong: Modeling annotator disagreement for subjective tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6715–6726, Singapore.
  15. Social chemistry 101: Learning to reason about social and moral norms. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 653–670, Online.
  16. Modeling subjective affect annotations with multi-task learning. Sensors, 22(14).
  17. Pritam Kadasi and Mayank Singh. 2023. Unveiling the multi-annotation process: Examining the influence of annotation quantity and instance difficulty on model performance.
  18. Introducing the gab hate corpus: defining and applying hate-based rhetoric to social media posts at scale. Lang. Resour. Eval., 56(1):79–108.
  19. Milton King and Paul Cook. 2020. Evaluating approaches to personalizing language models. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 2461–2469, Marseille, France.
  20. Designing toxic content classification for a diversity of perspectives.
  21. SemEval-2023 task 11: Learning with disagreements (LeWiDi). In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 2304–2318, Toronto, Canada.
  22. Agreeing to disagree: Annotating offensive language datasets with annotators’ disagreement. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10528–10539, Online and Punta Cana, Dominican Republic.
  23. UserIdentifier: Implicit user representations for simple and effective personalized sentiment analysis. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3449–3456, Seattle, United States.
  24. Swaroop Mishra and Bhavdeep Singh Sachdeva. 2020. Do we need to create big datasets to learn a task? In Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing, pages 169–173, Online.
  25. Stefanie Nowak and Stefan Rüger. 2010. How reliable are annotations via crowdsourcing? a study about inter-annotator agreement for multi-label image annotation. In Proceedings of the international conference on Multimedia information retrieval - MIR ’10, page 557. This is the author’s version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.
  26. Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics, 7:677–694.
  27. Unifying data perspectivism and personalization: An application to social norms. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 7391–7402, Abu Dhabi, United Arab Emirates.
  28. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China.
  29. Two contrasting data annotation paradigms for subjective NLP tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 175–190, Seattle, United States.
  30. KUISAIL at SemEval-2020 task 12: BERT-CNN for offensive speech identification in social media. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 2054–2059, Barcelona (online).
  31. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv preprint, abs/1910.01108.
  32. The risk of racial bias in hate speech detection. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1668–1678, Florence, Italy.
  33. Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5884–5906, Seattle, United States.
  34. Sadat Shahriar and Thamar Solorio. 2023. SafeWebUH at SemEval-2023 task 11: Learning annotator disagreement in derogatory text: Comparison of direct training vs aggregation. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 94–100, Toronto, Canada.
  35. Get another label? improving data quality and data mining using multiple, noisy labelers. Organizations & Markets eJournal.
  36. University at buffalo at SemEval-2023 task 11: MASDA–modelling annotator sensibilities through DisAggregation. In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 978–985, Toronto, Canada.
  37. iLab at SemEval-2023 task 11 le-wi-di: Modelling disagreement or modelling perspectives? In Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023), pages 1660–1669, Toronto, Canada.
  38. Xinpeng Wang and Barbara Plank. 2023. Actor: Active learning with annotator-specific classification heads to embrace human label variation. ArXiv preprint, abs/2310.14979.
  39. Leveraging similar users for personalized language modeling with limited data. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1742–1752, Dublin, Ireland.
  40. Compositional demographic word embeddings. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4076–4089, Online.
  41. Exploring the value of personalized word embeddings. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6856–6862, Barcelona, Spain (Online).
  42. Understanding interpersonal conflict types and their impact on perception classification. In Proceedings of the Fifth Workshop on Natural Language Processing and Computational Social Science (NLP+CSS), pages 79–88, Abu Dhabi, UAE.
  43. Ex machina: Personal attacks seen at scale. In Proceedings of the 26th International Conference on World Wide Web, WWW 2017, Perth, Australia, April 3-7, 2017, pages 1391–1399.
  44. Learning with different amounts of annotation: From zero to many labels. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7620–7632, Online and Punta Cana, Dominican Republic.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Olufunke O. Sarumi (2 papers)
  2. Béla Neuendorf (4 papers)
  3. Joan Plepi (9 papers)
  4. Lucie Flek (36 papers)
  5. Jörg Schlötterer (35 papers)
  6. Charles Welch (19 papers)

Summary

We haven't generated a summary for this paper yet.