Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Annotation Sensitivity: Training Data Collection Methods Affect Model Performance (2311.14212v3)

Published 23 Nov 2023 in stat.ML, cs.CL, cs.LG, and stat.ME

Abstract: When training data are collected from human annotators, the design of the annotation instrument, the instructions given to annotators, the characteristics of the annotators, and their interactions can impact training data. This study demonstrates that design choices made when creating an annotation instrument also impact the models trained on the resulting annotations. We introduce the term annotation sensitivity to refer to the impact of annotation data collection methods on the annotations themselves and on downstream model performance and predictions. We collect annotations of hate speech and offensive language in five experimental conditions of an annotation instrument, randomly assigning annotators to conditions. We then fine-tune BERT models on each of the five resulting datasets and evaluate model performance on a holdout portion of each condition. We find considerable differences between the conditions for 1) the share of hate speech/offensive language annotations, 2) model performance, 3) model predictions, and 4) model learning curves. Our results emphasize the crucial role played by the annotation instrument which has received little attention in the machine learning literature. We call for additional research into how and why the instrument impacts the annotations to inform the development of best practices in instrument design.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Identifying and measuring annotator bias based on annotators’ demographic characteristics. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 184–190, Online. Association for Computational Linguistics.
  2. Ground-truth, whose truth? - examining the challenges with annotating toxic text datasets. CoRR, abs/2112.03529.
  3. Lora Aroyo and Chris Welty. 2015. Truth is a lie: Crowd truth and the seven myths of human annotation. AI Magazine, 36(1):15–24.
  4. Improving labeling through social science insights: Results and research agenda. In HCI International 2022 – Late Breaking Papers: Interacting with eXtended Reality and Artificial Intelligence, pages 245–261, Cham. Springer Nature Switzerland.
  5. Automated hate speech detection and the problem of offensive language. Proceedings of the International AAAI Conference on Web and Social Media, 11(1):512–515.
  6. On the genealogy of machine learning datasets: A critical history of ImageNet. Big Data & Society, 8(2):205395172110359.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  8. Stephanie Eckman and Frauke Kreuter. 2011. Confirmation bias in housing unit listing. Public Opinion Quarterly, 75(1):139–150.
  9. Assessing the mechanisms of misreporting to filter questions in surveys. Public Opinion Quarterly, 78(3):721–733.
  10. The impact of misclassification due to survey response fatigue on estimation and identifiability of treatment effects. Statistics in Medicine, 30(30):3560–3572.
  11. Benoît Frénay and Michel Verleysen. 2013. Classification in the presence of label noise: a survey. IEEE transactions on neural networks and learning systems, 25(5):845–869.
  12. Mirta Galesic and Michael Bosnjak. 2009. Effects of Questionnaire Length on Participation and Indicators of Response Quality in a Web Survey. Public Opinion Quarterly, 73(2):349–360.
  13. A survey on concept drift adaptation. ACM computing surveys (CSUR), 46(4):1–37.
  14. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1161–1166, Hong Kong, China. Association for Computational Linguistics.
  15. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  16. Using mouse movements to predict web survey response difficulty. Social Science Computer Review, 35(3):388–405.
  17. Christopher K. Hsee and Jiao Zhang. 2004. Distinction bias: Misprediction and mischoice due to joint evaluation. Journal of Personality and Social Psychology, 86(5):680–695.
  18. Frauke Kreuter, editor. 2013. Improving Surveys with Paradata, 1 edition. John Wiley & Sons, Ltd.
  19. The Effects of Asking Filter Questions in Interleafed versus Grouped Format. Sociological Methods and Research, 40(88):88–104.
  20. Satisficing in surveys: Initial evidence. New Directions for Evaluation, 1996(70):29–44.
  21. Understanding deep learning performance through an examination of test set difficulty: A psychometric case study. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4711–4716, Brussels, Belgium. Association for Computational Linguistics.
  22. Quantitative evaluation of response scale translation through a randomized experiment of interview language with bilingual english- and spanish-speaking latino respondents. In The Essential Role of Language in Survey Research. RTI Press.
  23. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373–1411.
  24. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
  25. Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns, 2(11):100336.
  26. Pew Research Center. 2019. When Online Survey Respondents Only Select Some That Apply.
  27. Do imagenet classifiers generalize to imagenet?
  28. Rainer Schnell and Frauke Kreuter. 2005. Separating interviewer and sampling-point effects. Journal of Official Statistics, 21(3):389–410.
  29. Howard Schuman and Stanley Presser. 1996. Questions and answers in attitude surveys: Experiments on question form, wording, and context. Sage.
  30. Dataset cartography: Mapping and diagnosing datasets with training dynamics. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9275–9293, Online. Association for Computational Linguistics.
  31. Gerard J. Tellis and Deepa Chandrasekaran. 2010. Extent and impact of response biases in cross-national survey research. International Journal of Research in Marketing, 27(4):329–341.
  32. Motivated underreporting in screening interviews. Public Opinion Quarterly, 76(3):453–469.
  33. The psychology of survey response, 10. print edition. Cambridge University Press and Cambridge Univ. Press.
  34. Detecting East Asian prejudice on social media. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 162–172, Online. Association for Computational Linguistics.
  35. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  36. Data-centric artificial intelligence: A survey. arXiv preprint arXiv:2303.10158.
Citations (8)

Summary

We haven't generated a summary for this paper yet.