Open Datasheets: Machine-readable Documentation for Open Datasets and Responsible AI Assessments (2312.06153v2)
Abstract: This paper introduces a no-code, machine-readable documentation framework for open datasets, with a focus on responsible AI (RAI) considerations. The framework aims to improve comprehensibility, and usability of open datasets, facilitating easier discovery and use, better understanding of content and context, and evaluation of dataset quality and accuracy. The proposed framework is designed to streamline the evaluation of datasets, helping researchers, data scientists, and other open data users quickly identify datasets that meet their needs and organizational policies or regulations. The paper also discusses the implementation of the framework and provides recommendations to maximize its potential. The framework is expected to enhance the quality and reliability of data used in research and decision-making, fostering the development of more responsible and trustworthy AI systems.
- clandestino. https://github.com/microsoft/Clandestino/tree/main, 2023. Accessed: 2023-12-08.
- clinical visit note summarization corpus. https://github.com/microsoft/clinical_visit_note_summarization_corpus, 2023. Accessed: 2023-12-08.
- RTP-LX. https://github.com/microsoft/RTP-LX, 2023. Accessed: 2023-12-08.
- Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604, 2018.
- Open data and algorithms for open science in ai-driven molecular informatics. Current Opinion in Structural Biology, 79:102542, 2023.
- Bill Bruno. The True Cost Of Bad Data And How It Can Hinder The Benefits Of AI. https://www.forbes.com/sites/forbestechcouncil/2023/09/01/the-true-cost-of-bad-data-and-how-it-can-hinder-the-benefits-of-ai, 2023.
- Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91. PMLR, 2018.
- Human-centered design to address biases in artificial intelligence. Journal of Medical Internet Research, 25:e43251, 2023.
- The dataset nutrition label (2nd gen): Leveraging context to mitigate harms in artificial intelligence. arXiv preprint arXiv:2201.03954, 2022.
- Ten simple rules for improving research data discovery, 2022.
- Catherine Cote. WHAT IS DATA INTEGRITY AND WHY DOES IT MATTER? https://online.hbs.edu/blog/post/what-is-data-integrity, 2021.
- Mike Davie. The True Cost Of Bad Data And How It Can Hinder The Benefits Of AI. https://www.entrepreneur.com/en-au/growth-strategies/why-bad-data-could-cost-entrepreneurs-millions/332238, 2019.
- Müge Fazlioglu. Training AI on personal data scraped from the web. https://iapp.org/news/a/training-ai-on-personal-data-scraped-from-the-web, 2019.
- Frictionless Data Team. Frictionless Data. https://frictionlessdata.io, 2023. Accessed: 2023-12-08.
- Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
- Understanding machine learning practitioners’ data documentation perceptions, needs, challenges, and desiderata. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2):1–29, 2022.
- The dataset nutrition label. Data Protection and Privacy, 12(12):1, 2020.
- Datasets: A community library for natural language processing. arXiv preprint arXiv:2109.02846, 2021.
- Open data: Unlocking innovation and performance with Liquid Information. https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/open-data-unlocking-innovation-and-performance-with-liquid-information, 2013.
- Reusable templates and guides for documenting datasets and models for natural language processing and generation: A case study of the huggingface and gem data and model cards. arXiv preprint arXiv:2108.07374, 2021.
- Microsoft. AETHER DATA DOCUMENTATION TEMPLATE. https://www.microsoft.com/en-us/research/uploads/prod/2022/07/aether-datadoc-082522.pdf.
- Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464):447–453, 2019.
- Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns, 2(11), 2021.
- Privacy in the age of medical big data. Nature medicine, 25(1):37–43, 2019.
- Data cards: Purposeful and transparent dataset documentation for responsible ai. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 1776–1826, 2022.
- Open data on github: Unlocking the potential of ai, 2023.
- Manasi Sakpal. How to Improve Your Data Quality. https://www.gartner.com/smarterwithgartner/how-to-improve-your-data-quality, 2021.
- Tackling bias in artificial intelligence (and in humans). https://www.mckinsey.com/featured-insights/artificial-intelligence/tackling-bias-in-artificial-intelligence-and-in-humans, 2019.
- Frictionless Data Team. Frictionless Data Specs. https://github.com/frictionlessdata/specs, 2023. Accessed: 2023-12-08.
- Risks of using non-verified open data: A case study on using machine learning techniques for predicting pregnancy outcomes in india. arXiv preprint arXiv:1910.02136, 2019.
- Anthony Cintron Roman (2 papers)
- Jennifer Wortman Vaughan (52 papers)
- Valerie See (1 paper)
- Steph Ballard (1 paper)
- Jehu Torres (1 paper)
- Caleb Robinson (42 papers)
- Juan M. Lavista Ferres (25 papers)