Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia (2402.14147v1)
Abstract: AI tools are increasingly deployed in community contexts. However, datasets used to evaluate AI are typically created by developers and annotators outside a given community, which can yield misleading conclusions about AI performance. How might we empower communities to drive the intentional design and curation of evaluation datasets for AI that impacts them? We investigate this question on Wikipedia, an online community with multiple AI-based content moderation tools deployed. We introduce Wikibench, a system that enables communities to collaboratively curate AI evaluation datasets, while navigating ambiguities and differences in perspective through discussion. A field study on Wikipedia shows that datasets curated using Wikibench can effectively capture community consensus, disagreement, and uncertainty. Furthermore, study participants used Wikibench to shape the overall data curation process, including refining label definitions, determining data inclusion criteria, and authoring data statements. Based on our findings, we propose future directions for systems that support community-driven data curation.
- Lora Mois Aroyo and Praveen Kumar Paritosh. 2021. Adversarial Test Set for Image Classification: Lessons Learned from CATS4ML Data Challenge. (2021).
- Beat the machine: Challenging humans to find a predictive model’s “unknown unknowns”. Journal of Data and Information Quality (JDIQ) 6, 1 (2015), 1–17.
- Measuring Adversarial Datasets. arXiv preprint arXiv:2311.03566 (2023).
- Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems 35 (2022), 38176–38189.
- Beat the AI: Investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics 8 (2020), 662–678.
- Which examples should be multiply annotated? active learning when annotators may disagree. In Findings of the Association for Computational Linguistics: ACL 2023. 10352–10371.
- Wikipedian self-governance in action: Motivating the policy lens. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 2. 27–35.
- Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proceedings of the 2019 world wide web conference. 491–500.
- Virginia Braun and Victoria Clarke. 2012. Thematic analysis. American Psychological Association.
- Virginia Braun and Victoria Clarke. 2019. Reflecting on reflexive thematic analysis. Qualitative research in sport, exercise and health 11, 4 (2019), 589–597.
- Becoming Wikipedian: transformation of participation in a collaborative online encyclopedia. In Proceedings of the 2005 ACM International Conference on Supporting Group Work. 1–10.
- Toward a perspectivist turn in ground truthing for predictive computing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 6860–6868.
- Discovering and validating ai errors with crowdsourced failure reports. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–22.
- The Internet’s hidden rules: An empirical study of Reddit norm violations at micro, meso, and macro scales. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 1–25.
- Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2, 3 (2011), 1–27.
- Quan Ze Chen and Amy X Zhang. 2023. Judgment Sieve: Reducing Uncertainty in Group Judgments through Interventions Targeting Ambiguity versus Disagreement. arXiv preprint arXiv:2305.01615 (2023).
- Slice-based learning: A programming model for residual learning in critical data slices. Advances in neural information processing systems 32 (2019).
- Toxic Comment Classification Challenge. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge
- Sasha Costanza-Chock. 2020. Design justice: Community-led practices to build the worlds we need. The MIT Press.
- You Are What You Annotate: Towards Better Models through Annotator Representations. arXiv preprint arXiv:2305.14663 (2023).
- Understanding Practices, Challenges, and Opportunities for User-Engaged Algorithm Auditing in Industry Practice. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–18.
- Whose ground truth? accounting for individual and collective identities underlying dataset annotation. arXiv preprint arXiv:2112.04554 (2021).
- Bringing the people back in: Contesting benchmark machine learning datasets. arXiv preprint arXiv:2007.07399 (2020).
- Toward User-Driven Algorithm Auditing: Investigating users’ strategies for uncovering harmful algorithmic behavior. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19.
- Build it break it fix it for dialogue safety: Robustness from adversarial human attack. arXiv preprint arXiv:1908.06083 (2019).
- Modeling subjectivity (by Mimicking Annotator Annotation) in toxic comment identification across diverse communities. arXiv preprint arXiv:2311.00203 (2023).
- “Be careful; things can be worse than they appear”: Understanding Biased Algorithms and Users’ Behavior around Them in Rating Platforms. In Proceedings of the international AAAI conference on web and social media, Vol. 11. 62–71.
- User attitudes towards algorithmic opacity and transparency in online reviewing platforms. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–14.
- How People Initiate and Respond to Discussions Around Online Community Norms: A Preliminary Analysis on Meta Stack Overflow Discussions. In Companion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing. 221–225.
- Moral Machine or Tyranny of the Majority? arXiv preprint arXiv:2305.17319 (2023).
- Reddit rules! characterizing an ecosystem of governance. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 12.
- Andrew Flinn. 2007. Community histories, community archives: Some opportunities and challenges. Journal of the Society of Archivists 28, 2 (2007), 151–176.
- ” We Don’t Do That Here” How Collaborative Editing with Mentors Improves Engagement in Social Q&A Communities. In Proceedings of the 2018 CHI conference on human factors in computing systems. 1–12.
- Decentralization in Wikipedia governance. Journal of Management Information Systems 26, 1 (2009), 49–72.
- Evaluating models’ local decision boundaries via contrast sets. arXiv preprint arXiv:2004.02709 (2020).
- Datasheets for datasets. Commun. ACM 64, 12 (2021), 86–92.
- R Stuart Geiger and Aaron Halfaker. 2013. When the levee breaks: without bots, what happens to Wikipedia’s quality control processes?. In Proceedings of the 9th International Symposium on Open Collaboration. 1–6.
- R Stuart Geiger and Aaron Halfaker. 2017. Operationalizing conflict and cooperation between automated software agents in wikipedia: A replication and expansion of’even good bots fight’. Proceedings of the ACM on Human-Computer Interaction 1, CSCW (2017), 1–33.
- R Stuart Geiger and David Ribes. 2010. The work of sustaining order in Wikipedia: The banning of a vandal. In Proceedings of the 2010 ACM conference on Computer supported cooperative work. 117–126.
- Garbage in, garbage out? Do machine learning application papers in social computing report where human-labeled training data comes from?. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 325–336.
- Eric Gilbert. 2013. Widespread underprovision on reddit. In Proceedings of the 2013 conference on Computer supported cooperative work. 803–808.
- Jury learning: Integrating dissenting voices into machine learning models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19.
- The disagreement deconvolution: Bringing machine learning performance metrics in line with reality. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14.
- Is your toxicity my toxicity? Exploring the impact of rater identity on toxicity annotation. Proceedings of the ACM on Human-Computer Interaction 6, CSCW2 (2022), 1–28.
- Ground (less) Truth: A Causal Framework for Proxy Labels in Human-Algorithm Decision-Making. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 688–704.
- Aaron Halfaker and R Stuart Geiger. 2020. Ores: Lowering barriers with participatory machine learning in wikipedia. Proceedings of the ACM on Human-Computer Interaction 4, CSCW2 (2020), 1–37.
- The rise and decline of an open collaboration system: How Wikipedia’s reaction to popularity is causing its decline. American Behavioral Scientist 57, 5 (2013), 664–688.
- Don’t bite the newbies: how reverts affect the quantity and quality of Wikipedia work. In Proceedings of the 7th international symposium on wikis and open collaboration. 163–172.
- Cura: Curation at Social Media Scale. Proceedings of the ACM on Human-Computer Interaction 7, CSCW2 (2023), 1–33.
- Dorothy Howard and Lilly Irani. 2019. Ways of knowing when research subjects care. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–16.
- Sohyeon Hwang and Aaron Shaw. 2022. Rules and Rule-Making in the Five Largest Wikipedias. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 16. 347–357.
- Lilly C Irani and M Six Silberman. 2013. Turkopticon: Interrupting worker invisibility in amazon mechanical turk. In Proceedings of the SIGCHI conference on human factors in computing systems. 611–620.
- Overview and importance of data quality for machine learning tasks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 3561–3562.
- ” Did you suspect the post would be removed?” Understanding user reactions to content removals on Reddit. Proceedings of the ACM on human-computer interaction 3, CSCW (2019), 1–33.
- Human-machine collaboration for content regulation: The case of reddit automoderator. ACM Transactions on Computer-Human Interaction (TOCHI) 26, 5 (2019), 1–35.
- Decentralizing Platform Power: A Design Space of Multi-level Governance in Online Social Platforms. Social Media+ Society 9, 4 (2023), 20563051231207857.
- Embedding democratic values into social media AIs via societal objective functions. arXiv preprint arXiv:2307.13912 (2023).
- Eun Seo Jo and Timnit Gebru. 2020. Lessons from archives: Strategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 conference on fairness, accountability, and transparency. 306–316.
- Gary D Kader and Mike Perry. 2007. Variability for categorical variables. Journal of statistics education 15, 2 (2007).
- A hunt for the Snark: Annotator Diversity in Data Practices. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–15.
- “Why Do I Care What’s Similar?” Probing Challenges in AI-Assisted Child Welfare Decision-Making through Worker-AI Interface Design Concepts. In Designing Interactive Systems Conference. 454–470.
- Dynabench: Rethinking Benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 4110–4124. https://doi.org/10.18653/v1/2021.naacl-main.324
- He says, she says: conflict and coordination in Wikipedia. In Proceedings of the SIGCHI conference on Human factors in computing systems. 453–462.
- Understanding Frontline Workers’ and Unhoused Individuals’ Perspectives on AI Used in Homeless Services. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–17.
- End-User Audits: A System Empowering Communities to Lead Large-Scale Investigations of Harmful Algorithmic Behavior. Proceedings of the ACM on Human-Computer Interaction 6, CSCW2 (2022), 1–34.
- Sociotechnical Audits: Broadening the Algorithm Auditing Lens to Investigate Targeted Advertising. arXiv preprint arXiv:2308.15768 (2023).
- Procedural justice in algorithmic fairness: Leveraging transparency and outcome control for fair algorithmic mediation. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–26.
- Gerald S Leventhal. 1980. What should be done with equity theory? New approaches to the study of fairness in social relationships. In Social exchange: Advances in theory and research. Springer, 27–55.
- Design lessons from the fastest q&a site in the west. In Proceedings of the SIGCHI conference on Human factors in computing systems. 2857–2866.
- Adrienne Lynne Massanari. 2015. Participatory culture, community, and play. (2015).
- Dataperf: Benchmarks for data-centric ai development. arXiv preprint arXiv:2207.10062 (2022).
- Auditing algorithms: Understanding algorithmic systems from the outside in. Foundations and Trends® in Human–Computer Interaction 14, 4 (2021), 272–344.
- How data science workers work with data. In Conference on Human Factors in Computing Systems-Proceedings. 86–94.
- Designing ground truth and the social life of labels. In Proceedings of the 2021 CHI conference on human factors in computing systems. 1–16.
- Work-to-rule: the emergence of algorithmic governance in Wikipedia. In Proceedings of the 6th International Conference on Communities and Technologies. 80–89.
- Adversarial NLI: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599 (2019).
- Safiya Umoja Noble. 2018. Algorithms of oppression. In Algorithms of oppression. New York university press.
- DMLR: Data-centric Machine Learning Research–Past, Present and Future. arXiv preprint arXiv:2311.13028 (2023).
- Upvotes? Downvotes? No Votes? Understanding the relationship between reaction mechanisms and political discourse on Reddit. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–28.
- Mitigating dataset harms requires stewardship: Lessons from 1000 papers. arXiv preprint arXiv:2108.02922 (2021).
- AI and the everything in the whole wide world benchmark. arXiv preprint arXiv:2111.15366 (2021).
- “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15.
- The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics. 1668–1678.
- Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. arXiv preprint arXiv:2111.07997 (2021).
- Everyday algorithm auditing: Understanding the power of everyday users in surfacing harmful algorithmic behaviors. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–29.
- “Who is the right homeless client?”: Values in Algorithmic Homelessness Service Provision and Machine Learning Research. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–21.
- How AI fails us. arXiv preprint arXiv:2201.04200 (2021).
- Keeping community in the loop: Understanding wikipedia stakeholder values for machine learning-based systems. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–14.
- A Roadmap to Pluralistic Alignment. arXiv preprint arXiv:2402.05070 (2024).
- John W Thibaut and Laurens Walker. 1975. Procedural justice: A psychological analysis. (No Title) (1975).
- Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
- Trick me if you can: Human-in-the-loop generation of adversarial examples for question answering. Transactions of the Association for Computational Linguistics 7 (2019), 387–401.
- Debiased label aggregation for subjective crowdsourcing tasks. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. 1–8.
- Wikipedia ORES explorer: Visualizing trade-offs for designing applications with machine learning API. In Designing Interactive Systems Conference 2021. 1554–1565.
- Image cropping on twitter: Fairness metrics, their limitations, and the importance of representation, design, and agency. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–24.
- Deliberating with AI: Improving Decision-Making for the Future through Participatory AI Design and Stakeholder Deliberation. Proceedings of the ACM on Human-Computer Interaction 7, CSCW1 (2023), 1–32.
- Amy X Zhang and Justin Cranshaw. 2018. Making sense of group chat through collaborative tagging and summarization. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 1–27.