Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Data Authenticity, Consent, & Provenance for AI are all broken: what will it take to fix them? (2404.12691v2)

Published 19 Apr 2024 in cs.AI and cs.CY

Abstract: New capabilities in foundation models are owed in large part to massive, widely-sourced, and under-documented training data collections. Existing practices in data collection have led to challenges in tracing authenticity, verifying consent, preserving privacy, addressing representation and bias, respecting copyright, and overall developing ethical and trustworthy foundation models. In response, regulation is emphasizing the need for training data transparency to understand foundation models' limitations. Based on a large-scale analysis of the foundation model training data landscape and existing solutions, we identify the missing infrastructure to facilitate responsible foundation model development practices. We examine the current shortcomings of common tools for tracing data authenticity, consent, and documentation, and outline how policymakers, developers, and data creators can facilitate responsible foundation model development by adopting universal data provenance standards.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (109)
  1. New frontiers: The origins and content of new work, 1940–2018. NBER Working Paper 30389, National Bureau of Economic Research, August 2022. URL http://dx.doi.org/10.3386/w30389.
  2. Addressing “documentation debt” in machine learning research: A retrospective datasheet for bookcorpus. arXiv preprint arXiv:2105.05241, 2021.
  3. The New York Times got its content removed from one of the biggest AI training datasets. here’s how it did it. Business Insider, November 2023. URL https://www.businessinsider.com/new-york-times-content-removed-common-crawl-ai-training-dataset-2023-11. 2:00 AM PST.
  4. Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604, 2018. doi: 10.1162/tacl_a_00041. URL https://aclanthology.org/Q18-1041.
  5. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  6. AI accountability policy request for comment. Department of Commerce, National Telecommunications and Information Administration Docket No. 230407-0093, RIN 0660-XC057, Stanford University; Princeton University, 06 2023a. Response to NTIA’s AI Accountability Policy Request for Comment.
  7. The foundation model transparency index, 2023b.
  8. Ecosystem graphs: The social footprint of foundation models. ArXiv, abs/2303.15772, 2023c. URL https://arxiv.org/abs/2303.15772.
  9. Boyd, K. L. Datasheets for datasets help ML engineers notice and understand ethical issues in training data. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW2):1–27, 2021.
  10. Brittain, B. Lawsuits accuse AI content creators of misusing copyrighted work, 2023. URL https://www.reuters.com/legal/transactional/lawsuits-accuse-ai-content-creators-misusing-copyrighted-work-2023-01-17/.
  11. Language models are few-shot learners. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  12. The productivity j-curve: How intangibles complement general purpose technologies. NBER Working Paper 25148, National Bureau of Economic Research, 1050 Massachusetts Avenue, Cambridge, MA 02138, October 2018. URL https://www.nber.org/system/files/working_papers/w25148/w25148.pdf.
  13. Generative AI at work. National Bureau of Economic Research, Apr 2023. doi: 10.3386/w31161. URL https://www.nber.org/papers/w31161. Working Paper No. 31161.
  14. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Friedler, S. A. and Wilson, C. (eds.), Proceedings of the 1st Conference on Fairness, Accountability and Transparency, volume 81 of Proceedings of Machine Learning Research, pp.  77–91. PMLR, 23–24 Feb 2018. URL https://proceedings.mlr.press/v81/buolamwini18a.html.
  15. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp.  2633–2650, 2021.
  16. AI supply chains (and why they matter), April 2023. URL https://aipolicy.substack.com/p/supply-chains-2. The second post in our series On AI Deployment.
  17. Chayka, K. Is A.I. Art Stealing from Artists? The New Yorker, February 2023. ISSN 0028-792X. URL https://www.newyorker.com/culture/infinite-scroll/is-ai-art-stealing-from-artists. Section: infinite scroll.
  18. Cheng, M. How should creators be compensated for their work training AI models? Quartz, October 2023. URL https://qz.com/how-should-creators-be-compensated-for-their-work-train-1850932454.
  19. Language models trained on media diets can predict public opinion, 2023.
  20. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  21. Competition and Markets Authority. AI foundation models: Initial report. Government report, Competition and Markets Authority, 09 2023. A report following the CMA’s review into AI Foundation Models, and their impact on competition and consumer protection.
  22. Congressional Research Service. Generative artificial intelligence and copyright law. CRS Report No. LSB10922, 2023. URL https://crsreports.congress.gov/product/pdf/LSB/LSB10922.
  23. Cooke, C. Creator groups call for EU AI Act to retain strong transparency obligations. Complete Music Update, November 2023. URL https://completemusicupdate.com/creator-groups-ai-act/.
  24. Data Nutrition Team. The Data Nutrition Project. https://datanutrition.org/, 2021. An initiative aimed at providing tools and resources for evaluating the quality of datasets used in AI.
  25. David, E. AI image training dataset found to include child sexual abuse imagery. The Verge, December 2023a. URL https://www.theverge.com/2023/12/20/24009418/generative-ai-image-laion-csam-google-stability-stanford. 7:57 AM PST.
  26. David, E. News outlets demand new rules for AI training data. The Verge, August 2023b. URL https://www.theverge.com/2023/8/10/23827316/news-transparency-copyright-generative-ai.
  27. Navigating the jagged technological frontier: Field experimental evidence of the effects of AI on knowledge worker productivity and quality. SSRN Scholarly Paper No. 4573321, 2023. Available at SSRN: https://ssrn.com/abstract=4573321.
  28. An archival perspective on pretraining data. In Socially Responsible Language Modelling Research, 2023.
  29. DeviantArt Team. UPDATE all deviations are opted out of AI datasets. https://www.deviantart.com/team/journal/Tell-AI-Datasets-If-They-Can-t-Use-Your-Content-934500371, November 2023. 3 min read.
  30. Data Feminism. MIT Press, 2023. ISBN 9780262547185. URL https://mitpress.mit.edu/9780262547185/data-feminism/.
  31. Is GPT-3 a good data annotator? In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  11173–11195, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.626. URL https://aclanthology.org/2023.acl-long.626.
  32. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  1286–1305, 2021.
  33. What’s in my big data?, 2023.
  34. Art and the science of generative AI. Science, 380(6650):1110–1111, 2023.
  35. Creativity, copyright, and close-knit communities: a case study of social norm formation and enforcement. Proceedings of the ACM on Human-Computer Interaction, 3(GROUP):1–24, 2019.
  36. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  37. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
  38. ChatGPT outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023. doi: 10.1073/pnas.2305016120. URL https://www.pnas.org/doi/abs/10.1073/pnas.2305016120.
  39. Government of Canada. Voluntary Code of Conduct on the Responsible Development and Management of Advanced Generative AI Systems. https://ised-isde.canada.ca/site/ised/en/voluntary-code-conduct-responsible-development-and-management-advanced-generative-ai-systems, 2023.
  40. Forecasting extreme labor displacement: A survey of AI practitioners. Technological Forecasting and Social Change, 161:120323, December 2020. ISSN 0040-1625. doi: 10.1016/j.techfore.2020.120323. URL http://dx.doi.org/10.1016/j.techfore.2020.120323.
  41. Cleaning up ChatGPT takes heavy toll on human workers. The Wall Street Journal, Jul 2023. URL https://www.wsj.com/articles/chatgpt-openai-content-abusive-sexually-explicit-harassment-kenya-workers-on-human-workers-cf191483.
  42. Heath, A. ByteDance is secretly using OpenAI’s tech to build a competitor. The Verge, December 2023. URL https://www.theverge.com/2023/12/15/24003151/bytedance-china-openai-microsoft-competitor-llm. 12:21 PM PST.
  43. Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset. Advances in Neural Information Processing Systems, 35:29217–29234, 2022.
  44. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  45. Horton, J. J. Large language models as simulated economic agents: What can we learn from homo silicus?, 2023.
  46. Hidden digital watermarks in images. IEEE Transactions on Image Processing, 8(1):58–68, 1999. doi: 10.1109/83.736686.
  47. Internet Watch Foundation. How AI is being abused to create child sexual abuse imagery. https://www.iwf.org.uk/about-us/why-we-exist/our-research/how-ai-is-being-abused-to-create-child-sexual-abuse-imagery/, October 2023.
  48. Donottrain: A metadata standard for indicating consent for machine learning. In Workshop on Generative AI and Law, ICML 2023, 2023.
  49. Data governance in the age of large-scale data-driven language technology. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp.  2206–2222, 2022.
  50. Provenance documentation to enable explainable and trustworthy AI: A literature review. Data Intelligence, 5(1):139–162, 2023. doi: 10.1162/dint_a_00119. URL TheURLofthearticle,ifavailable. Received October 9, 2021; Revised January 20, 2022; Accepted February 4, 2022.
  51. Scaling laws for neural language models, 2020.
  52. Leakage and the reproducibility crisis in ML-based science. arXiv preprint arXiv:2207.07048, 2022.
  53. Defining best practices for opting out of ml training. https://openfuture.eu/publication/defining-best-practices-for-opting-out-of-ml-training/, September 2023.
  54. Khan, M. A new AI lexicon: Open. https://ainowinstitute.org/publication/a-new-ai-lexicon-open, 07 2021. Contextualizing ’Open Data’ and AI.
  55. A watermark for large language models. arXiv preprint arXiv:2301.10226, 2023.
  56. “this isn’t your data, friend”: Black Twitter as a case study on research ethics for public data. Social Media+ Society, 8(4):20563051221144317, 2022.
  57. Measuring bias in contextualized word representations. ArXiv, 2019. URL https://arxiv.org/abs/1906.07337.
  58. Talkin”bout ai generation: Copyright and the generative-AI supply chain. arXiv preprint arXiv:2309.08133, 2023.
  59. Fair learning. Texas Law Review, 99:743, 2020.
  60. Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
  61. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp.  175–184, 2021.
  62. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022.
  63. Lohr, S. Big companies find a way to identify A.I. data they can trust. The New York Times, 11 2023. URL https://www.nytimes.com/2023/11/30/business/ai-data-standards.html.
  64. The Data Provenance Initiative: A large scale audit of dataset licensing & attribution in ai. arXiv preprint arXiv:2310.16787, 2023a.
  65. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity, 2023b.
  66. Lupi, G. We’ve reached peak infographics. are you ready for what comes next? Print Mag, 161, January 2017. URL https://www.printmag.com/article/data-humanism-future-of-data-visualization/.
  67. Time waits for no one! analysis and challenges of temporal misalignment. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5944–5958, 2022.
  68. Comments regarding artificial intelligence and copyright. Submitted to U.S. Copyright Office, 2023.
  69. Maiberg, E. 404 Media generative AI market analysis: People love to cum. 404 Media, Sep 2023. URL https://www.404media.co/404-media-generative-ai-sector-analysis-people-love-to-cum/.
  70. Malik, A. OpenAI’s ChatGPT now has 100 million weekly active users. TechCrunch, Nov 2023. URL https://techcrunch.com/2023/11/06/openais-chatgpt-now-has-100-million-weekly-active-users/.
  71. AI adoption in America: Who, what, and where. Working Paper 31788, National Bureau of Economic Research, October 2023. URL http://dx.doi.org/10.3386/w31788.
  72. Silo language models: Isolating legal risk in a nonparametric datastore. ArXiv, abs/2308.04430, 2023. URL https://arxiv.org/abs/2308.04430.
  73. Model cards for model reporting. In Proceedings of the conference on fairness, accountability, and transparency, pp.  220–229, 2019.
  74. Generative AI companies must publish transparency reports. https://knightcolumbia.org/blog/generative-ai-companies-must-publish-transparency-reports, 06 2023. The debate about AI harms is happening in a data vacuum.
  75. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  76. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot, 2023.
  77. Stronger together: on the articulation of ethical charters, legal tools, and technical documentation in ML. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp.  343–354, 2023.
  78. Hello datasphere — towards a systems approach to data governance. https://www.thedatasphere.org/news/hello-datasphere-towards-a-systems-approach-to-data-governance/, February 2022.
  79. Data cards: Purposeful and transparent dataset documentation for responsible AI. In 2022 ACM Conference on Fairness, Accountability, and Transparency, pp.  1776–1826, 2022.
  80. Saving face: Investigating the ethical concerns of facial recognition auditing. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES ’20. ACM, February 2020. doi: 10.1145/3375627.3375820. URL http://dx.doi.org/10.1145/3375627.3375820.
  81. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922, 2023.
  82. Rogers, A. Changing the world by changing the data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  2182–2194, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.170. URL https://aclanthology.org/2021.acl-long.170.
  83. Just what do you think you’re doing, Dave?’a checklist for responsible data use in NLP. arXiv preprint arXiv:2109.06598, 2021.
  84. Rosenthol, L. C2PA: the world’s first industry standard for content provenance (conference presentation). In Tescher, A. G. and Ebrahimi, T. (eds.), Applications of Digital Image Processing XLV. SPIE, October 2022. doi: 10.1117/12.2632021. URL http://dx.doi.org/10.1117/12.2632021.
  85. Can AI-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.
  86. “everyone wants to do the model work, not the data work”: Data cascades in high-stakes AI. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, CHI ’21, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380966. doi: 10.1145/3411764.3445518. URL https://doi.org/10.1145/3411764.3445518.
  87. Paul Tremblay, Mona Awad vs. OpenAI, Inc., et al., 2023. URL https://storage.courtlistener.com/recap/gov.uscourts.cand.414822/gov.uscourts.cand.414822.1.0_1.pdf. Case 3:23-cv-03223-AMO Document 1 Filed 06/28/23, UNITED STATES DISTRICT COURT, NORTHERN DISTRICT OF CALIFORNIA, SAN FRANCISCO DIVISION.
  88. Exploring the landscape of machine unlearning: A comprehensive survey and taxonomy, 2024.
  89. The curse of recursion: Training on generated data makes models forget, 2023.
  90. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023.
  91. Small, Z. Sarah Silverman sues OpenAI and Meta over copyright infringement, 2023. URL https://www.nytimes.com/2023/07/10/arts/sarah-silverman-lawsuit-openai-meta.html.
  92. Smee, S. AI is no threat to traditional artists. but it is thrilling. The Washington Post, February 2023. URL https://www.washingtonpost.com/arts-entertainment/2023/02/15/ai-in-art/.
  93. Image representations learned with unsupervised pre-training contain human-like biases. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21. ACM, March 2021. doi: 10.1145/3442188.3445932. URL http://dx.doi.org/10.1145/3442188.3445932.
  94. United Nations. A global digital compact — an open, free and secure digital future for all. Publisher or Organization, May 2023. URL https://indonesia.un.org/sites/default/files/2023-07/our-common-agenda-policy-brief-gobal-digi-compact-en.pdf. Our Common Agenda Policy Brief #5.
  95. Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks, 2023.
  96. Humans forget, machines remember: Artificial intelligence and the right to be forgotten. Computer Law & Security Review, 34(2):304–313, 2018.
  97. Vincent, J. The lawsuit that could rewrite the rules of AI copyright, 2022. URL https://www.theverge.com/2022/11/8/23446821/microsoft-openai-github-copilot-class-action-lawsuit-ai-copyright-violation-training-data.
  98. Vincent, J. Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement — theverge.com. https://www.theverge.com/2023/2/6/23587393/ai-art-copyright-lawsuit-getty-images-stable-diffusion, 2023a. [Accessed 18-12-2023].
  99. Vincent, J. Getty Images is suing the creators of AI art tool Stable Diffusion for scraping its content. https://www.theverge.com/2023/1/17/23558516/ai-art-copyright-stable-diffusion-getty-images-lawsuit, 2023b. URL https://www.theverge.com/2023/1/17/23558516/ai-art-copyright-stable-diffusion-getty-images-lawsuit.
  100. Market concentration implications of foundation models: The invisible hand of ChatGPT. Brookings, Sep 2023. URL https://www.brookings.edu/articles/market-concentration-implications-of-foundation-models-the-invisible-hand-of-chatgpt/.
  101. Establishing data provenance for responsible artificial intelligence systems. ACM Transactions on Management Information Systems, 13(2):Article 22, 03 2022. doi: 10.1145/3503488. URL https://dl.acm.org/doi/pdf/10.1145/3503488.
  102. Westerlund, M. The emergence of deepfake technology: A review. Technology innovation management review, 9(11), 2019.
  103. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  104. Writers Guild of America. WGA negotiations—status as of may 1, 2023, May 2023. URL https://www.wga.org/uploadedfiles/members/member_info/contract-2023/WGA_proposals.pdf.
  105. The rise and potential of large language model based agents: A survey, 2023.
  106. Machine unlearning: A survey, 2023.
  107. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’18. ACM, December 2018. doi: 10.1145/3278721.3278779. URL http://dx.doi.org/10.1145/3278721.3278779.
  108. Watermarks in the sand: Impossibility of strong watermarking for generative models. arXiv preprint arXiv:2311.04378, 2023.
  109. Can large language models transform computational social science? Computational Linguistics, pp.  1–53, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Shayne Longpre (49 papers)
  2. Robert Mahari (16 papers)
  3. Naana Obeng-Marnu (4 papers)
  4. William Brannon (10 papers)
  5. Tobin South (18 papers)
  6. Katy Gero (2 papers)
  7. Sandy Pentland (9 papers)
  8. Jad Kabbara (13 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com