LabelAId: Just-in-time AI Interventions for Improving Human Labeling Quality and Domain Knowledge in Crowdsourcing Systems (2403.09810v1)
Abstract: Crowdsourcing platforms have transformed distributed problem-solving, yet quality control remains a persistent challenge. Traditional quality control measures, such as prescreening workers and refining instructions, often focus solely on optimizing economic output. This paper explores just-in-time AI interventions to enhance both labeling quality and domain-specific knowledge among crowdworkers. We introduce LabelAId, an advanced inference model combining Programmatic Weak Supervision (PWS) with FT-Transformers to infer label correctness based on user behavior and domain knowledge. Our technical evaluation shows that our LabelAId pipeline consistently outperforms state-of-the-art ML baselines, improving mistake inference accuracy by 36.7% with 50 downstream samples. We then implemented LabelAId into Project Sidewalk, an open-source crowdsourcing platform for urban accessibility. A between-subjects study with 34 participants demonstrates that LabelAId significantly enhances label precision without compromising efficiency while also increasing labeler confidence. We discuss LabelAId's success factors, limitations, and its generalizability to other crowdsourced science domains.
- Guidelines for Human-AI Interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, Glasgow Scotland Uk, 1–13. https://doi.org/10.1145/3290605.3300233
- Cognitive principles in the design of computer tutors. Department of Psychology, Carnegie-Mellon University.
- Does the Whole Exceed its Parts? The Effect of AI Explanations on Complementary Team Performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (CHI ’21). Association for Computing Machinery, New York, NY, USA, 1–16. https://doi.org/10.1145/3411764.3445717
- Wildbook: Crowdsourcing, computer vision, and data science for conservation. arXiv preprint arXiv:1710.08880 (2017).
- Soylent: a word processor with a crowd inside. In Proceedings of the 23nd annual ACM symposium on User interface software and technology. ACM, New York New York USA, 313–322. https://doi.org/10.1145/1866029.1866078
- Osprey: Weak Supervision of Imbalanced Extraction Problems without Code. In Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning. ACM, Amsterdam Netherlands, 1–11. https://doi.org/10.1145/3329486.3329492
- Raluca Budiu. 2013. Interaction cost. https://www.nngroup.com/articles/interaction-cost-definition/
- Raluca Budiu. 2018. Working Memory and External Memory. https://www.nngroup.com/articles/working-memory-external-memory/
- Proxy tasks and subjective measures can be misleading in evaluating explainable AI systems. In Proceedings of the 25th International Conference on Intelligent User Interfaces. ACM, Cagliari Italy, 454–464. https://doi.org/10.1145/3377325.3377498
- To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making. Proceedings of the ACM on Human-Computer Interaction 5, CSCW1 (April 2021), 1–21. https://doi.org/10.1145/3449287
- Chris Callison-Burch and Mark Dredze. 2010. Creating Speech and Language Data With Amazon’s Mechanical Turk. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk. Association for Computational Linguistics, Los Angeles, 1–12. https://aclanthology.org/W10-0701
- Marta B. Calás and Linda Smircich. 1999. Past Postmodernism? Reflections and Tentative Directions. The Academy of Management Review 24, 4 (1999), 649–671. https://doi.org/10.2307/259347 Publisher: Academy of Management.
- Dana Chandler and Adam Kapelner. 2013. Breaking monotony with meaning: Motivation in crowdsourcing markets. Journal of Economic Behavior & Organization 90 (June 2013), 123–133. https://doi.org/10.1016/j.jebo.2013.03.003
- Revolt: Collaborative Crowdsourcing for Labeling Machine Learning Datasets. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17). Association for Computing Machinery, New York, NY, USA, 2334–2346. https://doi.org/10.1145/3025453.3026044
- Kathy Charmaz. 2006. Constructing grounded theory: A practical guide through qualitative analysis. SAGE, Los Angeles London.
- Cicero: Multi-Turn, Contextual Argumentation for Accurate Crowdsourcing. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, Glasgow Scotland Uk, 1–14. https://doi.org/10.1145/3290605.3300761
- Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 785–794. https://doi.org/10.1145/2939672.2939785 arXiv:1603.02754 [cs].
- Predicting protein structures with a multiplayer online game. Nature 466, 7307 (Aug. 2010), 756–760. https://doi.org/10.1038/nature09304 Number: 7307 Publisher: Nature Publishing Group.
- ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. 248–255. https://doi.org/10.1109/CVPR.2009.5206848 ISSN: 1063-6919.
- Pick-a-crowd: tell me what you like, and i’ll tell you what to do. In Proceedings of the 22nd international conference on World Wide Web. ACM, Rio de Janeiro Brazil, 367–374. https://doi.org/10.1145/2488388.2488421
- Combining crowdsourcing and learning to improve engagement and performance. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’14). Association for Computing Machinery, New York, NY, USA, 3379–3388. https://doi.org/10.1145/2556288.2557217
- Toward a Learning Science for Complex Crowdsourcing Tasks. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). Association for Computing Machinery, New York, NY, USA, 2623–2634. https://doi.org/10.1145/2858036.2858268
- Shepherding the crowd yields better work. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work. ACM, Seattle Washington USA, 1013–1022. https://doi.org/10.1145/2145204.2145355
- Are your participants gaming the system?: screening mechanical turk workers. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, Atlanta Georgia USA, 2399–2402. https://doi.org/10.1145/1753326.1753688
- G* Power 3: A flexible statistical power analysis program for the social, behavioral, and biomedical sciences. Behavior research methods 39, 2 (2007), 175–191. ISBN: 1554-351X Publisher: Springer.
- Michel Foucault. 1920. Discipline & Punish. Random House, New York.
- Jonathan Fürst. 2020. Applying Weak Supervision to Mobile Sensor Data: Experiences with TransportMode Detection. In Jonathan Fürst. http://www.jofu.org/publication/furst-2020-transport/
- Crowd Anatomy Beyond the Good and Bad: Behavioral Traces for Crowd Worker Modeling and Pre-selection. Computer Supported Cooperative Work (CSCW) 28, 5 (Sept. 2019), 815–841. https://doi.org/10.1007/s10606-018-9336-y
- A Weakly Supervised Deep Learning Framework for Sorghum Head Detection and Counting. Plant Phenomics 2019 (June 2019). https://doi.org/10.34133/2019/1525874 Publisher: American Association for the Advancement of Science.
- Google. 2022. Open Images V7. https://storage.googleapis.com/openimages/web/index.html
- Revisiting Deep Learning Models for Tabular Data. In Advances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 18932–18943. https://proceedings.neurips.cc/paper_files/paper/2021/hash/9d86d83f925f2149e9edb0ac3b49229c-Abstract.html
- Ben Green and Yiling Chen. 2019. The Principles and Limits of Algorithm-in-the-Loop Decision Making. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (Nov. 2019), 50:1–50:24. https://doi.org/10.1145/3359152
- Tovi Grossman and George Fitzmaurice. 2010. ToolClips: an investigation of contextual video assistance for functionality understanding. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, Atlanta Georgia USA, 1515–1524. https://doi.org/10.1145/1753326.1753552
- A Systematic Review of the Effects of Automatic Scoring and Automatic Feedback in Educational Settings. IEEE Access 9 (2021), 108190–108198. https://doi.org/10.1109/ACCESS.2021.3100890
- Muki Haklay. 2013. Citizen Science and Volunteered Geographic Information: Overview and Typology of Participation. In Crowdsourcing Geographic Knowledge: Volunteered Geographic Information (VGI) in Theory and Practice, Daniel Sui, Sarah Elwood, and Michael Goodchild (Eds.). Springer Netherlands, Dordrecht, 105–122. https://doi.org/10.1007/978-94-007-4587-2_7
- Quality control mechanisms for crowdsourcing: peer review, arbitration, & expertise at familysearch indexing. In Proceedings of the 2013 conference on Computer supported cooperative work. ACM, San Antonio Texas USA, 649–660. https://doi.org/10.1145/2441776.2441848
- B. Hauptmann and A. Karni. 2002. From primed to learn: the saturation of repetition priming and the induction of long-term memory. Cognitive Brain Research 13, 3 (May 2002), 313–322. https://doi.org/10.1016/S0926-6410(01)00124-0
- Emlyn Hegarty-Kelly and Dr Aidan Mooney. 2021. Analysis of an automatic grading system within first year Computer Science programming modules. In Computing Education Practice 2021. ACM, Durham United Kingdom, 17–20. https://doi.org/10.1145/3437914.3437973
- The Challenge of Variable Effort Crowdsourcing and How Visible Gold Can Help. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (Oct. 2021), 1–26. https://doi.org/10.1145/3476073
- SciStarter 2.0: A Digital Platform to Foster and Study Sustained Engagement in Citizen Science. https://www.igi-global.com/gateway/chapter/www.igi-global.com/gateway/chapter/170184 Archive Location: scistarter-20 ISBN: 9781522509622 Publisher: IGI Global.
- Panagiotis G. Ipeirotis. 2010. Analyzing the Amazon Mechanical Turk marketplace. XRDS: Crossroads, The ACM Magazine for Students 17, 2 (Dec. 2010), 16–21. https://doi.org/10.1145/1869086.1869094
- How machine-learning recommendations influence clinician treatment selections: the example of antidepressant selection. Translational Psychiatry 11, 1 (Feb. 2021), 1–9. https://doi.org/10.1038/s41398-021-01224-x Number: 1 Publisher: Nature Publishing Group.
- On the Use of Multi-sensory Cues in Symmetric and Asymmetric Shared Collaborative Virtual Spaces. Proceedings of the ACM on Human-Computer Interaction 5, CSCW1 (April 2021), 72:1–72:25. https://doi.org/10.1145/3449146
- Combining human and machine intelligence in large-scale crowdsourcing.. In AAMAS, Vol. 12. 467–474.
- Gabriella Kazai and Imed Zitouni. 2016. Quality Management in Crowdsourcing using Gold Judges Behavior. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining. ACM, San Francisco California USA, 267–276. https://doi.org/10.1145/2835776.2835835
- Ashley Rose Kelly and Kate Maddalena. 2015. Harnessing Agency for Efficacy: “Foldit” and Citizen Science. Poroi 11, 1 (May 2015). https://doi.org/10.13008/2151-2957.1184 Number: 1 Publisher: The Project on Rhetoric of Inquiry (POROI).
- Crowdsourcing user studies with Mechanical Turk. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. ACM, Florence Italy, 453–456. https://doi.org/10.1145/1357054.1357127
- The future of crowd work. In Proceedings of the 2013 conference on Computer supported cooperative work. ACM, San Antonio Texas USA, 1301–1318. https://doi.org/10.1145/2441776.2441923
- CrowdForge: crowdsourcing complex work. In Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, Santa Barbara California USA, 43–52. https://doi.org/10.1145/2047196.2047202
- Dany Lacombe. 1996. Reforming Foucault: A Critique of the Social Control Thesis. The British Journal of Sociology 47, 2 (1996), 332–352. https://doi.org/10.2307/591730 Publisher: [Wiley, London School of Economics and Political Science, London School of Economics].
- Towards a Science of Human-AI Decision Making: An Overview of Design Space in Empirical Human-Subject Studies. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’23). Association for Computing Machinery, New York, NY, USA, 1369–1385. https://doi.org/10.1145/3593013.3594087
- Vivian Lai and Chenhao Tan. 2019. On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection. In Proceedings of the Conference on Fairness, Accountability, and Transparency. ACM, Atlanta GA USA, 29–38. https://doi.org/10.1145/3287560.3287590
- Page Laubheimer. 2019. The 3-Click Rule for Navigation Is False. https://www.nngroup.com/articles/3-click-rule/
- Crowdclass: Designing Classification-Based Citizen Science Learning Modules. Proceedings of the AAAI Conference on Human Computation and Crowdsourcing 4 (Sept. 2016), 109–118. https://doi.org/10.1609/hcomp.v4i1.13273
- Abe Leite and Saúl A. Blanco. 2020. Effects of Human vs. Automatic Feedback on Students’ Understanding of AI Concepts and Programming Style. In Proceedings of the 51st ACM Technical Symposium on Computer Science Education. ACM, Portland OR USA, 44–50. https://doi.org/10.1145/3328778.3366921
- Transfer Learning with Deep Tabular Models. https://doi.org/10.48550/arXiv.2206.15306 arXiv:2206.15306 [cs, stat].
- “I never realized sidewalks were a big deal”: A Case Study of a Community-Driven Sidewalk Accessibility Assessment using Project Sidewalk. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI’24). ACM, New York, NY, USA. https://doi.org/10.1145/3613904.3642003
- Microsoft COCO: Common Objects in Context. http://arxiv.org/abs/1405.0312 arXiv:1405.0312 [cs].
- Evaluating the effectiveness of accessibility features for roadway users with visual impairment: A case study for Nanjing, China. Transportation Research Part F: Traffic Psychology and Behaviour 97 (Aug. 2023), 301–313. https://doi.org/10.1016/j.trf.2023.07.021
- Learning From the Crowd: Observational Learning in Crowdsourcing Communities. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. ACM, San Jose California USA, 2635–2644. https://doi.org/10.1145/2858036.2858560
- A Learning Effect by Presenting Machine Prediction as a Reference Answer in Self-correction. In 2018 IEEE International Conference on Big Data (Big Data). 3522–3528. https://doi.org/10.1109/BigData.2018.8622435
- Crowd-Worker Skill Improvement with AI Co-Learners. In Proceedings of the 9th International Conference on Human-Agent Interaction (HAI ’21). Association for Computing Machinery, New York, NY, USA, 316–322. https://doi.org/10.1145/3472307.3484684
- Jakob Nielsen. 2006. Progressive Disclosure. https://www.nngroup.com/articles/progressive-disclosure/
- Jakob Nielsen. 2018. Confirmation Dialogs Can Prevent User Errors (If Not Overused). https://www.nngroup.com/articles/confirmation-dialog/
- Platemate: crowdsourcing nutritional analysis from food photographs. In Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, Santa Barbara California USA, 1–12. https://doi.org/10.1145/2047196.2047198
- U.S. Department of Transportation (DOT). 2023. Curb Ramps. https://safety.fhwa.dot.gov/saferjourney1/library/countermeasures/03.htm
- InterWeave: Presenting Search Suggestions in Context Scaffolds Information Search and Synthesis. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (UIST ’22). Association for Computing Machinery, New York, NY, USA, 1–16. https://doi.org/10.1145/3526113.3545696
- A framework for articulating and measuring individual learning outcomes from participation in citizen science. Citizen Science: Theory and Practice 3, 2 (2018). ISBN: 2057-4991 Publisher: Ubiquity Press.
- Galaxy Zoo: Motivations of Citizen Scientists. http://arxiv.org/abs/1303.6886 arXiv:1303.6886 [astro-ph, physics:physics].
- Snorkel: rapid training data creation with weak supervision. Proceedings of the VLDB Endowment 11, 3 (Nov. 2017), 269–282. https://doi.org/10.14778/3157794.3157797
- Training Complex Models with Multi-Task Weak Supervision. Proceedings of the AAAI Conference on Artificial Intelligence 33, 01 (July 2019), 4763–4771. https://doi.org/10.1609/aaai.v33i01.33014763 Number: 01.
- Data Programming: Creating Large Training Sets, Quickly. In Advances in Neural Information Processing Systems, Vol. 29. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2016/hash/6709e8d64a5f47269ed5cea9f625f7ab-Abstract.html
- An Exploratory Factor Analysis of Motivations for Participating in Zooniverse, a Collection of Virtual Citizen Science Projects. In 2013 46th Hawaii International Conference on System Sciences. 610–619. https://doi.org/10.1109/HICSS.2013.85 ISSN: 1530-1605.
- From Crowd to Community: A Survey of Online Community Features in Citizen Science Projects. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing. ACM, Portland Oregon USA, 2137–2152. https://doi.org/10.1145/2998181.2998302
- Judy Robertson and Maurits Kaptein (Eds.). 2016. Modern Statistical Methods for HCI. Springer International Publishing, Cham. https://doi.org/10.1007/978-3-319-26633-6
- An Assessment of Intrinsic and Extrinsic Motivation on Task Performance in Crowdsourcing Markets. Proceedings of the International AAAI Conference on Web and Social Media 5, 1 (2011), 321–328. https://doi.org/10.1609/icwsm.v5i1.14105 Number: 1.
- Jeffrey Rzeszotarski and Aniket Kittur. 2012. CrowdScape: interactively visualizing user behavior and output. In Proceedings of the 25th annual ACM symposium on User interface software and technology. ACM, Cambridge Massachusetts USA, 55–62. https://doi.org/10.1145/2380116.2380125
- Jeffrey M. Rzeszotarski and Aniket Kittur. 2011. Instrumenting the crowd: using implicit behavioral measures to predict task performance. In Proceedings of the 24th annual ACM symposium on User interface software and technology. ACM, Santa Barbara California USA, 13–22. https://doi.org/10.1145/2047196.2047199
- Labelling instructions matter in biomedical image analysis. Nature Machine Intelligence 5, 3 (March 2023), 273–283. https://doi.org/10.1038/s42256-023-00625-5 Number: 3 Publisher: Nature Publishing Group.
- Project Sidewalk: A Web-based Crowdsourcing Tool for Collecting Sidewalk Accessibility Data At Scale. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI ’19). Association for Computing Machinery, New York, NY, USA, 1–14. https://doi.org/10.1145/3290605.3300292
- Designing incentives for inexpert human raters. In Proceedings of the ACM 2011 conference on Computer supported cooperative work. ACM, Hangzhou China, 275–284. https://doi.org/10.1145/1958824.1958865
- Ben Shneiderman. 2020. Human-Centered Artificial Intelligence: Reliable, Safe & Trustworthy. International Journal of Human–Computer Interaction 36, 6 (April 2020), 495–504. https://doi.org/10.1080/10447318.2020.1741118
- Learning with Weak Supervision for Email Intent Detection. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 1051–1060. https://doi.org/10.1145/3397271.3401121 arXiv:2005.13084 [cs].
- Zooniverse: observing the world’s largest citizen science platform. In Proceedings of the 23rd International Conference on World Wide Web. ACM, Seoul Korea, 1049–1054. https://doi.org/10.1145/2567948.2579215
- Design and Development of a Machine Learning Tool for an Innovation-Based Learning MOOC. In 2019 IEEE Learning With MOOCS (LWMOOCS). 105–109. https://doi.org/10.1109/LWMOOCS47620.2019.8939621
- Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Honolulu, Hawaii, 254–263. https://aclanthology.org/D08-1027
- MLP-Mixer: An all-MLP Architecture for Vision. In Advances in Neural Information Processing Systems, Vol. 34. Curran Associates, Inc., 24261–24272. https://proceedings.neurips.cc/paper/2021/hash/cba0a4ee5ccd02fda0fe3f9a3e7b89fe-Abstract.html
- iNaturalist as an engaging tool for identifying organisms in outdoor activities. Journal of Biological Education 55, 5 (Oct. 2021), 537–547. https://doi.org/10.1080/00219266.2020.1739114 Publisher: Routledge _eprint: https://doi.org/10.1080/00219266.2020.1739114.
- Multi-Resolution Weak Supervision for Sequential Data. In Advances in Neural Information Processing Systems, Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/hash/93db85ed909c13838ff95ccfa94cebd9-Abstract.html
- Exploring Trade-Offs Between Learning and Productivity in Crowdsourced History. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (Nov. 2018), 1–24. https://doi.org/10.1145/3274447
- Deep learning for automatically detecting sidewalk accessibility problems using streetscape imagery. In Proceedings of the 21st International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS ’19). Association for Computing Machinery, New York, NY, USA, 196–209. https://doi.org/10.1145/3308561.3353798
- Automated essay scoring in applied games: Reducing the teacher bandwidth problem in online training. Computers & Education 123 (Aug. 2018), 212–224. https://doi.org/10.1016/j.compedu.2018.05.010
- Dennis M. Wilkinson and Bernardo A. Huberman. 2007. Cooperation and quality in wikipedia. In Proceedings of the 2007 international symposium on Wikis (WikiSym ’07). Association for Computing Machinery, New York, NY, USA, 157–164. https://doi.org/10.1145/1296951.1296968
- Fostering Creative Performance of Platform Crowdworkers: The Digital Feedback Dilemma. International Journal of Electronic Commerce 25, 3 (July 2021), 263–286. https://doi.org/10.1080/10864415.2021.1942674
- Creating contextual help for GUIs using screenshots. In Proceedings of the 24th annual ACM symposium on User interface software and technology. 145–154.
- A Survey on Programmatic Weak Supervision. https://doi.org/10.48550/arXiv.2202.05433 arXiv:2202.05433 [cs, stat].
- Understanding programmatic weak supervision via source-aware influence function. Advances in Neural Information Processing Systems 35 (2022), 2862–2875.
- SamDSK: Combining Segment Anything Model with Domain-Specific Knowledge for Semi-Supervised Learning in Medical Image Segmentation. https://doi.org/10.48550/arXiv.2308.13759 arXiv:2308.13759 [cs].
- Reviewing versus doing: learning and performance in crowd assessment. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing (CSCW ’14). Association for Computing Machinery, New York, NY, USA, 1445–1455. https://doi.org/10.1145/2531602.2531718
- A Comprehensive Survey on Transfer Learning. https://doi.org/10.48550/arXiv.1911.02685 arXiv:1911.02685 [cs, stat].