Position: Insights from Survey Methodology can Improve Training Data (2403.01208v2)
Abstract: Whether future AI models are fair, trustworthy, and aligned with the public's interests rests in part on our ability to collect accurate data about what we want the models to do. However, collecting high-quality data is difficult, and few AI/ML researchers are trained in data collection methods. Recent research in data-centric AI has show that higher quality training data leads to better performing models, making this the right moment to introduce AI/ML researchers to the field of survey methodology, the science of data collection. We summarize insights from the survey methodology literature and discuss how they can improve the quality of training and feedback data. We also suggest collaborative research ideas into how biases in data collection can be mitigated, making models more accurate and human-centric.
- Identifying and measuring annotator bias based on annotators’ demographic characteristics. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pp. 184–190, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.alw-1.21. URL https://aclanthology.org/2020.alw-1.21.
- Self-consuming generative models go mad, 2023.
- Truth is a lie: Crowd truth and the seven myths of human annotation. AI Magazine, 36(1):15–24, 2015. doi: 10.1609/aimag.v36i1.2564. URL https://ojs.aaai.org/aimagazine/index.php/aimagazine/article/view/2564.
- Asch, S. E. Effects of group pressure upon the modification and distortion of judgments. In Organizational influence processes, pp. 295–303. Routledge, 2016.
- Interpreting predictive probabilities: Model confidence or human label variation? In Association for Computational Linguistics: EACL 2024, Malta, March 2024. Association for Computational Linguistics.
- We need to consider disagreement in evaluation. In Proceedings of the 1st workshop on benchmarking: past, present and future, pp. 15–21. Association for Computational Linguistics, 2021.
- CrossRE: A cross-domain dataset for relation extraction. In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2022, pp. 3592–3604, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.263. URL https://aclanthology.org/2022.findings-emnlp.263.
- Improving labeling through social science insights: Results and research agenda. In Chen, J. Y. C., Fragomeni, G., Degen, H., and Ntoa, S. (eds.), HCI International 2022 – Late Breaking Papers: Interacting with eXtended Reality and Artificial Intelligence, pp. 245–261, Cham, 2022. Springer Nature Switzerland. ISBN 978-3-031-21707-4.
- Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604, 2018. doi: 10.1162/tacl_a_00041. URL https://aclanthology.org/Q18-1041.
- Evaluating online labor markets for experimental research: Amazon.com’s mechanical turk. Political Analysis, 20(3):351–368, 2012. doi: 10.1093/pan/mpr057.
- Anchoring and agreement in syntactic annotations. In Su, J., Duh, K., and Carreras, X. (eds.), Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 2215–2224, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1239. URL https://aclanthology.org/D16-1239.
- Handbook of Nonresponse in Household Surveys. Wiley, January 2011. ISBN 9780470891056. doi: 10.1002/9780470891056. URL http://dx.doi.org/10.1002/9780470891056.
- Mental construal and the emergence of assimilation and contrast effects: The inclusion/exclusion model. Advances in Experimental Social Psychology, 42:319–373, 12 2010. doi: 10.1016/S0065-2601(10)42006-7.
- “i know what you’re going to ask me” why respondents don’t read survey questions. International Journal of Market Research, 61(4):366–379, 2019. doi: 10.1177/1470785318821025. URL https://doi.org/10.1177/1470785318821025.
- The dataset nutrition label (2nd gen): Leveraging context to mitigate harms in artificial intelligence, 2022.
- Cialdini, R. B. Influence : science and practice. Pearson Education, Boston, 2009. ISBN 0205609996 9780205609994 9780205663781 0205663788. URL http://www.amazon.co.uk/Influence-Practice-Robert-B-Cialdini/dp/0205663788.
- Task force on 2020 pre-election polling: An evaluation of the 2020 general election polls, 2021.
- Reducing speeding in web surveys by providing immediate feedback. Survey Research Methods, Vol 11:No 1 (2017), 2017. doi: 10.18148/SRM/2017.V11I1.6304. URL https://ojs.ub.uni-konstanz.de/srm/article/view/6304.
- Informed consent for web paradata use. Survey Research Methods, 7(1):57–67, Dec. 2012. doi: 10.18148/srm/2013.v7i1.5138. URL https://ojs.ub.uni-konstanz.de/srm/article/view/5138.
- de Leeuw E., H. J. . L. A. International nonresponse trends across countries and years: An analysis of 36 years of labour force survey data, 2018.
- Algorithm aversion: People erroneously avoid algorithms after seeing them err. Journal of Experimental Psychology General, 2015. doi: 10.1037/xge0000033.
- Questioning the survey responses of large language models, 2023.
- Assessing the mechanisms of misreporting to filter questions in surveys. Public Opinion Quarterly, 78(3):721–733, 2014. doi: 10.1093/poq/nfu030. URL https://doi.org/10.1093/poq/nfu030.
- Social Cognition. McGraw Hill Higher Education, Maidenhead, England, 2 edition, April 1991.
- Fort, K. Collaborative annotation for reliable natural language processing: Technical and sociological aspects. John Wiley & Sons, 2016.
- Effects of Questionnaire Length on Participation and Indicators of Response Quality in a Web Survey. Public Opinion Quarterly, 73(2):349–360, 05 2009. ISSN 0033-362X. doi: 10.1093/poq/nfp031. URL https://doi.org/10.1093/poq/nfp031.
- Eye-tracking data: New insights on response order effects and other cognitive shortcuts in survey responding. Public Opinion Quarterly, 72(5):892–913, December 2008. ISSN 1537-5331. doi: 10.1093/poq/nfn059. URL http://dx.doi.org/10.1093/poq/nfn059.
- Datasheets for datasets. Communications of the ACM, 64(12):86–92, November 2021. ISSN 1557-7317. doi: 10.1145/3458723. URL http://dx.doi.org/10.1145/3458723.
- Garbage in, garbage out?: do machine learning application papers in social computing report where human-labeled training data comes from? In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20. ACM, January 2020. doi: 10.1145/3351095.3372862. URL http://dx.doi.org/10.1145/3351095.3372862.
- Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 1161–1166, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1107. URL https://aclanthology.org/D19-1107.
- Groves, R. M. The interviewer as a source of survey measurement error. In Survey Errors and Survey Costs, chapter 8. Wiley, July 1989. ISBN 9780471725275. doi: 10.1002/0471725277. URL http://dx.doi.org/10.1002/0471725277.
- Nonresponse in Household Interview Surveys. Wiley, April 1998. ISBN 9781118490082. doi: 10.1002/9781118490082. URL http://dx.doi.org/10.1002/9781118490082.
- Survey Methodology. Wiley Series in Survey Methodology. Wiley-Blackwell, Hoboken, NJ, 2 edition, June 2009.
- Using instructed response items as attention checks in web surveys: Properties and implementation. Sociological Methods & Research, 2018. doi: 10.1177/0049124118769083.
- Measurement Errors in Censuses and Surveys. 1960.
- Privacy Attitudes toward Mouse-Tracking Paradata Collection. Public Opinion Quarterly, 87(S1):602–618, 08 2023. ISSN 0033-362X. doi: 10.1093/poq/nfad034. URL https://doi.org/10.1093/poq/nfad034.
- Using mouse movements to predict web survey response difficulty. Social Science Computer Review, 35(3):388–405, 2017. ISSN 0894-4393. doi: 10.1177/0894439315626360. URL https://doi.org/10.1177/0894439315626360.
- Incorporating worker perspectives into MTurk annotation practices for NLP. In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 1010–1028, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.64. URL https://aclanthology.org/2023.emnlp-main.64.
- Towards accountability for machine learning datasets: Practices from software engineering and infrastructure, 2020.
- Ai safety needs social scientists. Distill, 4(2), February 2019. ISSN 2476-0757. doi: 10.23915/distill.00014. URL http://dx.doi.org/10.23915/distill.00014.
- Is that still the same? has that changed? on the accuracy of measuring change with dependent interviewing. Journal of Survey Statistics and Methodology, 8(4):706–725, July 2019. ISSN 2325-0992. doi: 10.1093/jssam/smz021. URL http://dx.doi.org/10.1093/jssam/smz021.
- Kahneman, D. Thinking, fast and slow. Farrar, Straus and Giroux, New York, 2011. ISBN 9780374275631 0374275637.
- An evaluation of 2016 election polls in the united states, 2017.
- Annotation sensitivity: Training data collection methods affect model performance. In Bouamor, H., Pino, J., and Bali, K. (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 14874–14886, Singapore, December 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.findings-emnlp.992.
- Kish, L. Studies of interviewer variance for attitudinal variables. Journal of the American Statistical Association, 57(297):92–115, 1962. doi: 10.1080/01621459.1962.10482153. URL https://www.tandfonline.com/doi/abs/10.1080/01621459.1962.10482153.
- Why people say "yes": a dual-process theory of acquiescence. Journal of Personality and Social Psychology, 77:379–386, 1999. doi: 10.1037/0022-3514.77.2.379.
- Kreuter, F. Improving Surveys with Paradata. John Wiley & Sons, Inc., Hoboken, New Jersey, September 2013.
- The Effects of Asking Filter Questions in Interleafed versus Grouped Format. Sociological Methods and Research, 40(88):88–104, 2011.
- Krosnick, J. A. Response Strategies for Coping With the Cognitive Demands of Attitude Measures in Surveys. Applied Cognitive Psychology, 5(5):213–236, 1991. URL https://web.stanford.edu/dept/communication/faculty/krosnick/docs/1991/1991%20Satisficing.pdf.
- Satisficing in surveys: Initial evidence. New Directions for Evaluation, 1996(70):29–44, March 1996. doi: 10.1002/ev.1033. URL https://doi.org/10.1002/ev.1033.
- Informed consent for paradata use in web surveys. International Journal of Market Research, 62(4):396–408, 2020. doi: 10.1177/1470785320931669. URL https://doi.org/10.1177/1470785320931669.
- Annotation curricula to implicitly train non-expert annotators. Computational Linguistics, 48(2):343–373, June 2022. doi: 10.1162/coli_a_00436. URL https://aclanthology.org/2022.cl-2.4.
- Evaluating the impact of pre-annotation on annotation speed and potential bias: natural language processing gold standard development for clinical named entity recognition in clinical trial announcements. Journal of the American Medical Informatics Association, 21(3):406–413, 09 2013. ISSN 1067-5027. doi: 10.1136/amiajnl-2013-001837. URL https://doi.org/10.1136/amiajnl-2013-001837.
- Logg, J. M. Theory of machine: When do people rely on algorithms?", 2017. URL http://nrs.harvard.edu/urn-3:HUL.InstRepos:31677474.
- Dataperf: Benchmarks for data-centric ai development, 2023.
- Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM, January 2019. doi: 10.1145/3287560.3287596. URL http://dx.doi.org/10.1145/3287560.3287596.
- Automation use and automation bias. Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 43(3):344–348, 1999. doi: 10.1177/154193129904300346. URL https://doi.org/10.1177/154193129904300346.
- What ingredients make for an effective crowdsourcing protocol for difficult NLU data collection tasks? In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1221–1235, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.98. URL https://aclanthology.org/2021.acl-long.98.
- Transparency in statistical information for the national center for science and engineering statistics and all federal statistical agencies, 2022. URL https://doi.org/10.17226/26360.
- What can we learn from collective human opinions on natural language inference data?, 2020.
- Norman, D. Emotional design. Basic Books, London, England, March 2007.
- Training language models to follow instructions with human feedback, 2022.
- Automated annotation with generative ai requires validation, 2023.
- Don’t blame the annotator: Bias already starts in the annotation instructions. In Vlachos, A. and Augenstein, I. (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 1779–1789, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.eacl-main.130. URL https://aclanthology.org/2023.eacl-main.130.
- How does the crowd impact the model? a tool for raising awareness of social bias in crowdsourced training data. In Proceedings of the 31st ACM International Conference on Information; Knowledge Management, CIKM ’22. ACM, October 2022. doi: 10.1145/3511808.3557178. URL http://dx.doi.org/10.1145/3511808.3557178.
- Pew Research Center. When online survey respondents only select some that apply, 2019. URL https://www.pewresearch.org/methods/2019/05/09/when-online-survey-respondents-only-select-some-that-apply/.
- Increasing respondents’ use of definitions in web surveys. Journal of Official Statistics, 26:633–650, 2010.
- Plank, B. The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Goldberg, Y., Kozareva, Z., and Zhang, Y. (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 10671–10682, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.731. URL https://aclanthology.org/2022.emnlp-main.731.
- On releasing annotator-level labels and information in datasets. In Bonial, C. and Xue, N. (eds.), Proceedings of the Joint 15th Linguistic Annotation Workshop (LAW) and 3rd Designing Meaning Representations (DMR) Workshop, pp. 133–138, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.law-1.14. URL https://aclanthology.org/2021.law-1.14.
- Natural Language Annotation for Machine Learning: A guide to corpus-building for applications. " O’Reilly Media, Inc.", 2012.
- Do imagenet classifiers generalize to imagenet?, 2019.
- Everyday concepts and classification errors: Judgments of disability and residence. Journal of Official Statistics, 22:385–418, 2006.
- "everyone wants to do the model work, not the data work": Data cascades in high-stakes ai. 2021.
- Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. In Carpuat, M., de Marneffe, M.-C., and Meza Ruiz, I. V. (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 5884–5906, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.431. URL https://aclanthology.org/2022.naacl-main.431.
- Separating Interviewer and Sampling-Point Effects. Journal of Official Statistics, 21(3):389, 2005. ISSN 0282423X. ISBN: 0282423X.
- Questions and answers in attitude surveys: Experiments on question form, wording, and context. Sage, 1996.
- Schwarz, N. Cognitive aspects of survey methodology. Applied Cognitive Psychology, 21(2):277–287, February 2007. ISSN 1099-0720. doi: 10.1002/acp.1340. URL http://dx.doi.org/10.1002/acp.1340.
- Silver, N. How our pollster ratings work, March 2023. URL https://fivethirtyeight.com/methodology/how-our-pollster-ratings-work/.
- Pal, a tool for pre-annotation and active learning. Journal for Language Technology and Computational Linguistics, 2016. doi: 10.21248/jlcl.31.2016.203.
- Smyth, J. D. Comparing check-all and forced-choice question formats in web surveys. Public Opinion Quarterly, 70(1):66–77, March 2006. ISSN 1537-5331. doi: 10.1093/poq/nfj007. URL http://dx.doi.org/10.1093/poq/nfj007.
- Evaluating the effects of machine pre-annotation and an interactive annotation interface on manual de-identification of clinical text. Journal of Biomedical Informatics, 50:162–172, 2014. ISSN 1532-0464. doi: https://doi.org/10.1016/j.jbi.2014.05.002. URL https://www.sciencedirect.com/science/article/pii/S1532046414001191. Special Issue on Informatics Methods in Medical Privacy.
- Strack, F. “order effects” in survey research: Activation and information functions of preceding questions. In Schwarz, N. and Sudman, S. (eds.), Context Effects in Social and Psychological Research, pp. 23–34, New York, NY, 1992. Springer New York. ISBN 978-1-4612-2848-6. doi: 10.1007/978-1-4612-2848-6_3. URL https://doi.org/10.1007/978-1-4612-2848-6_3.
- Report of the inquiry into the 2015 british general election opinion polls, 2016.
- Do llms exhibit human-like response biases? a case study in survey design, 2023.
- Tourangeau, R. The survey response process from a cognitive viewpoint. Quality Assurance in Education, 26(2):169–181, April 2018. ISSN 0968-4883. doi: 10.1108/qae-06-2017-0034. URL http://dx.doi.org/10.1108/QAE-06-2017-0034.
- The psychology of survey response. Cambridge University Press and Cambridge Univ. Press, 10. print edition, 2000. ISBN 978-0-521-57246-0.
- Motivated underreporting in screening interviews. Public Opinion Quarterly, 76(3):453–469, August 2012. doi: 10.1093/poq/nfs033. URL https://doi.org/10.1093/poq/nfs033.
- The effects of providing examples in survey questions. Public Opinion Quarterly, 78:100–125, 03 2014. doi: 10.1093/poq/nft083.
- Judgment under uncertainty: Heuristics and biases: Biases in judgments reveal some heuristics of thinking under uncertainty. science, 185(4157):1124–1131, 1974.
- Explaining Interviewer Effects: A Research Synthesis. Journal of Survey Statistics and Methodology, 5(2):smw024, November 2016. ISSN 2325-0984. doi: 10.1093/jssam/smw024. URL http://dx.doi.org/10.1093/jssam/smw024.
- Trends in u.s. face-to-face household survey nonresponse and level of effort. J. Surv. Stat. Methodol., 6(2):186–211, June 2018.
- Willis, G. B. Cognitive interviewing. Sage Publications, Christchurch, New Zealand, September 2004.
- A simple theory of the survey response: Answering questions versus revealing preferences. American Journal of Political Science, 36(3):579, August 1992. ISSN 0092-5853. doi: 10.2307/2111583. URL http://dx.doi.org/10.2307/2111583.