Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Resolving the Human Subjects Status of Machine Learning's Crowdworkers (2206.04039v2)

Published 8 Jun 2022 in cs.CY, cs.AI, cs.CL, cs.LG, and stat.ML

Abstract: In recent years, ML has relied heavily on crowdworkers both for building datasets and for addressing research questions requiring human interaction or judgment. The diverse tasks performed and uses of the data produced render it difficult to determine when crowdworkers are best thought of as workers (versus human subjects). These difficulties are compounded by conflicting policies, with some institutions and researchers regarding all ML crowdworkers as human subjects and others holding that they rarely constitute human subjects. Notably few ML papers involving crowdwork mention IRB oversight, raising the prospect of non-compliance with ethical and regulatory requirements. We investigate the appropriate designation of ML crowdsourcing studies, focusing our inquiry on natural language processing to expose unique challenges for research oversight. Crucially, under the U.S. Common Rule, these judgments hinge on determinations of aboutness, concerning both whom (or what) the collected data is about and whom (or what) the analysis is about. We highlight two challenges posed by ML: the same set of workers can serve multiple roles and provide many sorts of information; and ML research tends to embrace a dynamic workflow, where research questions are seldom stated ex ante and data sharing opens the door for future studies to aim questions at different targets. Our analysis exposes a potential loophole in the Common Rule, where researchers can elude research ethics oversight by splitting data collection and analysis into distinct studies. Finally, we offer several policy recommendations to address these concerns.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Masakhaner: Named entity recognition for african languages. arXiv preprint arXiv:2103.11811.
  2. Explanations for commonsenseqa: New dataset and models. In Workshop on Commonsense Reasoning and Knowledge Bases.
  3. Pens: A dataset and generic framework for personalized news headline generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), (pp. 82–92).
  4. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6, 587–604.
  5. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, (pp. 610–623).
  6. Integrating ethics into the nlp curriculum. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts, (pp. 6–9).
  7. Language (technology) is power: A critical survey of “bias” in nlp. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (pp. 5454–5476).
  8. With little power comes great responsibility. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), (pp. 9263–9274).
  9. ERASER: A Benchmark to Evaluate Rationalized NLP Models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, (pp. 4443–4458).
  10. Show your work: Improved reporting of experimental results. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), (pp. 2185–2194).
  11. Crowdsourcing natural language data at scale: A hands-on tutorial. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorials, (pp. 25–30).
  12. Last words: Amazon Mechanical Turk: Gold mine or coal mine? Computational Linguistics, 37(2), 413–420. URL https://www.aclweb.org/anthology/J11-2010
  13. Experts, errors, and context: A large-scale study of human evaluation for machine translation. arXiv preprint arXiv:2104.14478.
  14. Evaluating models’ local decision boundaries via contrast sets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, (pp. 1307–1323).
  15. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), (pp. 1161–1166).
  16. Inspired: Toward sociable recommendation dialog systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), (pp. 8142–8152).
  17. Experiments with crowdsourced re-annotation of a pos tagging data set. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), (pp. 377–382).
  18. The social impact of natural language processing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), (pp. 591–598).
  19. Ipeirotis, P. (????). Mechanical Turk, Human Subjects, and IRB’s. URL https://www.behind-the-enemy-lines.com/2009/01/mechanical-turk-human-subjects-and-irbs.html
  20. Learning the difference that makes a difference with counterfactually-augmented data. In International Conference on Learning Representations. URL https://openreview.net/forum?id=Sklgs0NFvr
  21. On the efficacy of adversarial data collection for question answering results from a large-scale randomized study. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP).
  22. More bang for your buck: Natural perturbation for robust question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), (pp. 163–170).
  23. Examining gender and race bias in two hundred sentiment analysis systems. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, (pp. 43–53).
  24. Crowdsourcing in computer vision. Foundations and Trends in Computer Graphics and Vision, 10(3), 177–243.
  25. Kummerfeld, J. K. (2021). Quantifying and avoiding unfair qualification labour in crowdsourcing. arXiv preprint arXiv:2105.12762.
  26. Dvd: A diagnostic dataset for multi-step reasoning in video grounded dialogue. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), (pp. 5651–5665).
  27. Ethical by design: Ethics best practices for natural language processing. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, (pp. 30–40).
  28. London, A. J. (2021). For the Common Good: Philosophical Foundations of Research Ethics. Oxford University Press.
  29. Clinical trial portfolios: a critical oversight in human research ethics, drug regulation, and policy. Hastings Center Report, 49(4), 31–41.
  30. Loopholes in the research ethics system? informed consent waivers in cluster randomized trials with individual-level intervention. Ethics & human research, 42(6), 21–28.
  31. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, (pp. 2381–2391).
  32. Participatory research for low-resourced machine translation: A case study in african languages. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, (pp. 2144–2160).
  33. Easy, reproducible and quality-controlled data collection with crowdaq. arXiv preprint arXiv:2010.06694.
  34. Masakhane–machine translation for africa. arXiv preprint arXiv:2003.11529.
  35. Experiments in open domain deception detection. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, (pp. 1120–1125).
  36. Gender bias in coreference resolution. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), (pp. 8–14).
  37. Machine learning with crowdsourcing: A brief summary of the past research and future directions. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, (pp. 9837–9843).
  38. Beyond fair pay: Ethical implications of nlp crowdsourcing. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, (pp. 3758–3769).
  39. Responsible research with crowds: pay crowdworkers at least minimum wage. Communications of the ACM, 61(3), 39–41.
  40. Energy and policy considerations for modern deep learning research. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, (pp. 13693–13696).
  41. Lexicon-based methods for sentiment analysis. Computational linguistics, 37(2), 267–307.
  42. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), (pp. 4149–4158).
  43. Vaughan, J. W. (2017). Making better use of the crowd: How crowdsourcing can advance machine learning research. J. Mach. Learn. Res., 18(1), 7026–7071.
  44. Fair work: Crowd work minimum wage with one line of code. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, vol. 7, (pp. 197–206).
  45. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), (pp. 1112–1122).
  46. Recipes for safety in open-domain chatbots. arXiv preprint arXiv:2010.07079.
  47. Using “annotator rationales” to improve machine learning for text categorization. In Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, (pp. 260–267). Association for Computational Linguistics. URL https://aclanthology.org/N07-1033
  48. Photochat: A human-human dialogue dataset with photo sharing behavior for joint image-text modeling. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), (pp. 6142–6152).
  49. Multimet: A multimodal dataset for metaphor understanding. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), (pp. 3214–3225).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Divyansh Kaushik (8 papers)
  2. Alex John London (2 papers)
  3. Zachary C. Lipton (137 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.