Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Wikibench: Community-Driven Data Curation for AI Evaluation on Wikipedia (2402.14147v1)

Published 21 Feb 2024 in cs.HC and cs.AI

Abstract: AI tools are increasingly deployed in community contexts. However, datasets used to evaluate AI are typically created by developers and annotators outside a given community, which can yield misleading conclusions about AI performance. How might we empower communities to drive the intentional design and curation of evaluation datasets for AI that impacts them? We investigate this question on Wikipedia, an online community with multiple AI-based content moderation tools deployed. We introduce Wikibench, a system that enables communities to collaboratively curate AI evaluation datasets, while navigating ambiguities and differences in perspective through discussion. A field study on Wikipedia shows that datasets curated using Wikibench can effectively capture community consensus, disagreement, and uncertainty. Furthermore, study participants used Wikibench to shape the overall data curation process, including refining label definitions, determining data inclusion criteria, and authoring data statements. Based on our findings, we propose future directions for systems that support community-driven data curation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (97)
  1. Lora Mois Aroyo and Praveen Kumar Paritosh. 2021. Adversarial Test Set for Image Classification: Lessons Learned from CATS4ML Data Challenge. (2021).
  2. Beat the machine: Challenging humans to find a predictive model’s “unknown unknowns”. Journal of Data and Information Quality (JDIQ) 6, 1 (2015), 1–17.
  3. Measuring Adversarial Datasets. arXiv preprint arXiv:2311.03566 (2023).
  4. Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems 35 (2022), 38176–38189.
  5. Beat the AI: Investigating adversarial human annotation for reading comprehension. Transactions of the Association for Computational Linguistics 8 (2020), 662–678.
  6. Which examples should be multiply annotated? active learning when annotators may disagree. In Findings of the Association for Computational Linguistics: ACL 2023. 10352–10371.
  7. Wikipedian self-governance in action: Motivating the policy lens. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 2. 27–35.
  8. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proceedings of the 2019 world wide web conference. 491–500.
  9. Virginia Braun and Victoria Clarke. 2012. Thematic analysis. American Psychological Association.
  10. Virginia Braun and Victoria Clarke. 2019. Reflecting on reflexive thematic analysis. Qualitative research in sport, exercise and health 11, 4 (2019), 589–597.
  11. Becoming Wikipedian: transformation of participation in a collaborative online encyclopedia. In Proceedings of the 2005 ACM International Conference on Supporting Group Work. 1–10.
  12. Toward a perspectivist turn in ground truthing for predictive computing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 6860–6868.
  13. Discovering and validating ai errors with crowdsourced failure reports. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–22.
  14. The Internet’s hidden rules: An empirical study of Reddit norm violations at micro, meso, and macro scales. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 1–25.
  15. Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST) 2, 3 (2011), 1–27.
  16. Quan Ze Chen and Amy X Zhang. 2023. Judgment Sieve: Reducing Uncertainty in Group Judgments through Interventions Targeting Ambiguity versus Disagreement. arXiv preprint arXiv:2305.01615 (2023).
  17. Slice-based learning: A programming model for residual learning in critical data slices. Advances in neural information processing systems 32 (2019).
  18. Toxic Comment Classification Challenge. https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge
  19. Sasha Costanza-Chock. 2020. Design justice: Community-led practices to build the worlds we need. The MIT Press.
  20. You Are What You Annotate: Towards Better Models through Annotator Representations. arXiv preprint arXiv:2305.14663 (2023).
  21. Understanding Practices, Challenges, and Opportunities for User-Engaged Algorithm Auditing in Industry Practice. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–18.
  22. Whose ground truth? accounting for individual and collective identities underlying dataset annotation. arXiv preprint arXiv:2112.04554 (2021).
  23. Bringing the people back in: Contesting benchmark machine learning datasets. arXiv preprint arXiv:2007.07399 (2020).
  24. Toward User-Driven Algorithm Auditing: Investigating users’ strategies for uncovering harmful algorithmic behavior. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19.
  25. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. arXiv preprint arXiv:1908.06083 (2019).
  26. Modeling subjectivity (by Mimicking Annotator Annotation) in toxic comment identification across diverse communities. arXiv preprint arXiv:2311.00203 (2023).
  27. “Be careful; things can be worse than they appear”: Understanding Biased Algorithms and Users’ Behavior around Them in Rating Platforms. In Proceedings of the international AAAI conference on web and social media, Vol. 11. 62–71.
  28. User attitudes towards algorithmic opacity and transparency in online reviewing platforms. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–14.
  29. How People Initiate and Respond to Discussions Around Online Community Norms: A Preliminary Analysis on Meta Stack Overflow Discussions. In Companion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing. 221–225.
  30. Moral Machine or Tyranny of the Majority? arXiv preprint arXiv:2305.17319 (2023).
  31. Reddit rules! characterizing an ecosystem of governance. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 12.
  32. Andrew Flinn. 2007. Community histories, community archives: Some opportunities and challenges. Journal of the Society of Archivists 28, 2 (2007), 151–176.
  33. ” We Don’t Do That Here” How Collaborative Editing with Mentors Improves Engagement in Social Q&A Communities. In Proceedings of the 2018 CHI conference on human factors in computing systems. 1–12.
  34. Decentralization in Wikipedia governance. Journal of Management Information Systems 26, 1 (2009), 49–72.
  35. Evaluating models’ local decision boundaries via contrast sets. arXiv preprint arXiv:2004.02709 (2020).
  36. Datasheets for datasets. Commun. ACM 64, 12 (2021), 86–92.
  37. R Stuart Geiger and Aaron Halfaker. 2013. When the levee breaks: without bots, what happens to Wikipedia’s quality control processes?. In Proceedings of the 9th International Symposium on Open Collaboration. 1–6.
  38. R Stuart Geiger and Aaron Halfaker. 2017. Operationalizing conflict and cooperation between automated software agents in wikipedia: A replication and expansion of’even good bots fight’. Proceedings of the ACM on Human-Computer Interaction 1, CSCW (2017), 1–33.
  39. R Stuart Geiger and David Ribes. 2010. The work of sustaining order in Wikipedia: The banning of a vandal. In Proceedings of the 2010 ACM conference on Computer supported cooperative work. 117–126.
  40. Garbage in, garbage out? Do machine learning application papers in social computing report where human-labeled training data comes from?. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 325–336.
  41. Eric Gilbert. 2013. Widespread underprovision on reddit. In Proceedings of the 2013 conference on Computer supported cooperative work. 803–808.
  42. Jury learning: Integrating dissenting voices into machine learning models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–19.
  43. The disagreement deconvolution: Bringing machine learning performance metrics in line with reality. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–14.
  44. Is your toxicity my toxicity? Exploring the impact of rater identity on toxicity annotation. Proceedings of the ACM on Human-Computer Interaction 6, CSCW2 (2022), 1–28.
  45. Ground (less) Truth: A Causal Framework for Proxy Labels in Human-Algorithm Decision-Making. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 688–704.
  46. Aaron Halfaker and R Stuart Geiger. 2020. Ores: Lowering barriers with participatory machine learning in wikipedia. Proceedings of the ACM on Human-Computer Interaction 4, CSCW2 (2020), 1–37.
  47. The rise and decline of an open collaboration system: How Wikipedia’s reaction to popularity is causing its decline. American Behavioral Scientist 57, 5 (2013), 664–688.
  48. Don’t bite the newbies: how reverts affect the quantity and quality of Wikipedia work. In Proceedings of the 7th international symposium on wikis and open collaboration. 163–172.
  49. Cura: Curation at Social Media Scale. Proceedings of the ACM on Human-Computer Interaction 7, CSCW2 (2023), 1–33.
  50. Dorothy Howard and Lilly Irani. 2019. Ways of knowing when research subjects care. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–16.
  51. Sohyeon Hwang and Aaron Shaw. 2022. Rules and Rule-Making in the Five Largest Wikipedias. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 16. 347–357.
  52. Lilly C Irani and M Six Silberman. 2013. Turkopticon: Interrupting worker invisibility in amazon mechanical turk. In Proceedings of the SIGCHI conference on human factors in computing systems. 611–620.
  53. Overview and importance of data quality for machine learning tasks. In Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 3561–3562.
  54. ” Did you suspect the post would be removed?” Understanding user reactions to content removals on Reddit. Proceedings of the ACM on human-computer interaction 3, CSCW (2019), 1–33.
  55. Human-machine collaboration for content regulation: The case of reddit automoderator. ACM Transactions on Computer-Human Interaction (TOCHI) 26, 5 (2019), 1–35.
  56. Decentralizing Platform Power: A Design Space of Multi-level Governance in Online Social Platforms. Social Media+ Society 9, 4 (2023), 20563051231207857.
  57. Embedding democratic values into social media AIs via societal objective functions. arXiv preprint arXiv:2307.13912 (2023).
  58. Eun Seo Jo and Timnit Gebru. 2020. Lessons from archives: Strategies for collecting sociocultural data in machine learning. In Proceedings of the 2020 conference on fairness, accountability, and transparency. 306–316.
  59. Gary D Kader and Mike Perry. 2007. Variability for categorical variables. Journal of statistics education 15, 2 (2007).
  60. A hunt for the Snark: Annotator Diversity in Data Practices. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–15.
  61. “Why Do I Care What’s Similar?” Probing Challenges in AI-Assisted Child Welfare Decision-Making through Worker-AI Interface Design Concepts. In Designing Interactive Systems Conference. 454–470.
  62. Dynabench: Rethinking Benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 4110–4124. https://doi.org/10.18653/v1/2021.naacl-main.324
  63. He says, she says: conflict and coordination in Wikipedia. In Proceedings of the SIGCHI conference on Human factors in computing systems. 453–462.
  64. Understanding Frontline Workers’ and Unhoused Individuals’ Perspectives on AI Used in Homeless Services. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–17.
  65. End-User Audits: A System Empowering Communities to Lead Large-Scale Investigations of Harmful Algorithmic Behavior. Proceedings of the ACM on Human-Computer Interaction 6, CSCW2 (2022), 1–34.
  66. Sociotechnical Audits: Broadening the Algorithm Auditing Lens to Investigate Targeted Advertising. arXiv preprint arXiv:2308.15768 (2023).
  67. Procedural justice in algorithmic fairness: Leveraging transparency and outcome control for fair algorithmic mediation. Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–26.
  68. Gerald S Leventhal. 1980. What should be done with equity theory? New approaches to the study of fairness in social relationships. In Social exchange: Advances in theory and research. Springer, 27–55.
  69. Design lessons from the fastest q&a site in the west. In Proceedings of the SIGCHI conference on Human factors in computing systems. 2857–2866.
  70. Adrienne Lynne Massanari. 2015. Participatory culture, community, and play. (2015).
  71. Dataperf: Benchmarks for data-centric ai development. arXiv preprint arXiv:2207.10062 (2022).
  72. Auditing algorithms: Understanding algorithmic systems from the outside in. Foundations and Trends® in Human–Computer Interaction 14, 4 (2021), 272–344.
  73. How data science workers work with data. In Conference on Human Factors in Computing Systems-Proceedings. 86–94.
  74. Designing ground truth and the social life of labels. In Proceedings of the 2021 CHI conference on human factors in computing systems. 1–16.
  75. Work-to-rule: the emergence of algorithmic governance in Wikipedia. In Proceedings of the 6th International Conference on Communities and Technologies. 80–89.
  76. Adversarial NLI: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599 (2019).
  77. Safiya Umoja Noble. 2018. Algorithms of oppression. In Algorithms of oppression. New York university press.
  78. DMLR: Data-centric Machine Learning Research–Past, Present and Future. arXiv preprint arXiv:2311.13028 (2023).
  79. Upvotes? Downvotes? No Votes? Understanding the relationship between reaction mechanisms and political discourse on Reddit. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–28.
  80. Mitigating dataset harms requires stewardship: Lessons from 1000 papers. arXiv preprint arXiv:2108.02922 (2021).
  81. AI and the everything in the whole wide world benchmark. arXiv preprint arXiv:2111.15366 (2021).
  82. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. 1–15.
  83. The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics. 1668–1678.
  84. Annotators with attitudes: How annotator beliefs and identities bias toxic language detection. arXiv preprint arXiv:2111.07997 (2021).
  85. Everyday algorithm auditing: Understanding the power of everyday users in surfacing harmful algorithmic behaviors. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–29.
  86. “Who is the right homeless client?”: Values in Algorithmic Homelessness Service Provision and Machine Learning Research. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–21.
  87. How AI fails us. arXiv preprint arXiv:2201.04200 (2021).
  88. Keeping community in the loop: Understanding wikipedia stakeholder values for machine learning-based systems. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–14.
  89. A Roadmap to Pluralistic Alignment. arXiv preprint arXiv:2402.05070 (2024).
  90. John W Thibaut and Laurens Walker. 1975. Procedural justice: A psychological analysis. (No Title) (1975).
  91. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
  92. Trick me if you can: Human-in-the-loop generation of adversarial examples for question answering. Transactions of the Association for Computational Linguistics 7 (2019), 387–401.
  93. Debiased label aggregation for subjective crowdsourcing tasks. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. 1–8.
  94. Wikipedia ORES explorer: Visualizing trade-offs for designing applications with machine learning API. In Designing Interactive Systems Conference 2021. 1554–1565.
  95. Image cropping on twitter: Fairness metrics, their limitations, and the importance of representation, design, and agency. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–24.
  96. Deliberating with AI: Improving Decision-Making for the Future through Participatory AI Design and Stakeholder Deliberation. Proceedings of the ACM on Human-Computer Interaction 7, CSCW1 (2023), 1–32.
  97. Amy X Zhang and Justin Cranshaw. 2018. Making sense of group chat through collaborative tagging and summarization. Proceedings of the ACM on Human-Computer Interaction 2, CSCW (2018), 1–27.
Citations (12)

Summary

  • The paper presents a novel approach where community members curate AI evaluation datasets, ensuring diverse and representative inputs.
  • It demonstrates how comparative analysis of AI models can reveal misalignments with community norms and values.
  • The study advocates for scalable community-driven curation methods to enhance the design and evaluation of AI tools.

Empowering Communities in AI Evaluation: The Wikibench Study on Wikipedia

The Advent of Wikibench

The paper explores a prevalent issue in the deployment and evaluation of AI tools within community contexts, particularly focusing on Wikipedia's AI-based content moderation tools. This paper introduces Wikibench, a novel system designed to enable community-driven curation of AI evaluation datasets. Wikibench facilitates collaborative dataset curation by allowing community members to select data points, label them based on personal judgment, and engage in discussions to arrive at a consensus label. This approach notably contrasts with traditional methods of dataset creation, often resulting in datasets that may not accurately represent community consensus or capture the diverse perspectives within the community.

Evaluating Wikibench's Effectiveness

A comprehensive field paper on Wikipedia utilized Wikibench to curate datasets for AI evaluation, revealing several critical insights:

  • Community Consensus and Diverse Perspectives: The paper demonstrated that datasets curated through Wikibench effectively captured community consensus, disagreement, and uncertainty. It underscores Wikibench's potential in addressing the challenges of traditional dataset creation methods that may overlook the nuanced perspectives within a community.
  • Practical Implications for AI Evaluation: By comparing two AI models deployed on Wikipedia, the paper illustrated how Wikibench-curated datasets could provide valuable insights into the alignment of AI models with community norms and values. This comparison highlighted the potential misalignments between community perspectives and AI predictions, emphasizing the importance of community-driven evaluation datasets.

Future Directions in Community-driven Data Curation

The findings from this paper suggest several promising directions for advancing community-driven data curation and AI evaluation:

  • Adapting Wikibench Across Different Communities: Considering Wikibench's success within Wikipedia, future research could explore adapting Wikibench for use in other community contexts. This would involve tailoring Wikibench's design to align with the norms and workflows specific to different communities.
  • Enhancing Efficiency and Representativeness: Additional studies could focus on balancing community agency in the curation process with the efficiency of data collection and ensuring the dataset's representativeness. This may involve developing methods to guide communities in achieving desired distributional properties for their datasets.
  • Community-facing Evaluation Interfaces: There is a need for designing interfaces that enable communities to leverage their curated datasets for informed decision-making about AI design and deployment. These interfaces could facilitate more nuanced analyses of AI models' alignments with community perspectives.
  • Leveraging Content Curation Mechanisms: Wikibench or similar systems could draw further inspiration from existing content curation mechanisms on online platforms. This might involve incorporating new features that support the prioritization of data points for curation based on community-shared visions and values.

Conclusion

The Wikibench paper presents a pivotal step towards empowering communities in the AI evaluation process. By fostering community-driven dataset curation, Wikibench addresses the critical need for AI tools that align with community norms and values. The insights gained from this paper illuminate the path forward for HCI systems that support community-driven data curation, ultimately striving for AI tools that enhance rather than disrupt community ecosystems.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Reddit Logo Streamline Icon: https://streamlinehq.com