Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bridging the Data Provenance Gap Across Text, Speech and Video (2412.17847v2)

Published 19 Dec 2024 in cs.AI, cs.CL, cs.CY, cs.LG, and cs.MM

Abstract: Progress in AI is driven largely by the scale and quality of training data. Despite this, there is a deficit of empirical analysis examining the attributes of well-established datasets beyond text. In this work we conduct the largest and first-of-its-kind longitudinal audit across modalities--popular text, speech, and video datasets--from their detailed sourcing trends and use restrictions to their geographical and linguistic representation. Our manual analysis covers nearly 4000 public datasets between 1990-2024, spanning 608 languages, 798 sources, 659 organizations, and 67 countries. We find that multimodal machine learning applications have overwhelmingly turned to web-crawled, synthetic, and social media platforms, such as YouTube, for their training sets, eclipsing all other sources since 2019. Secondly, tracing the chain of dataset derivations we find that while less than 33% of datasets are restrictively licensed, over 80% of the source content in widely-used text, speech, and video datasets, carry non-commercial restrictions. Finally, counter to the rising number of languages and geographies represented in public AI training datasets, our audit demonstrates measures of relative geographical and multilingual representation have failed to significantly improve their coverage since 2013. We believe the breadth of our audit enables us to empirically examine trends in data sourcing, restrictions, and Western-centricity at an ecosystem-level, and that visibility into these questions are essential to progress in responsible AI. As a contribution to ongoing improvements in dataset transparency and responsible use, we release our entire multimodal audit, allowing practitioners to trace data provenance across text, speech, and video.

Summary

  • The paper provides an extensive audit of nearly 4000 datasets, examining provenance, licensing, and representation across text, speech, and video modalities.
  • The paper reveals a shift since 2019 toward uncurated sources like web-crawled and synthetic data, highlighting privacy risks and legal inconsistencies.
  • The paper calls for enhanced dataset documentation and standardized provenance frameworks to support ethical and inclusive AI development.

Bridging the Data Provenance Gap Across Text, Speech, and Video

The paper "Bridging the Data Provenance Gap Across Text, Speech, and Video" presents a significant empirical audit of AI training datasets across text, speech, and video modalities. By covering approximately 4000 public datasets, the paper aims to bridge the gap in the understanding of data provenance and representation in AI's multimodal applications. It provides an extensive investigation into the sourcing trends, use restrictions, and geographical and linguistic representation of datasets spanning a period from 1990 to 2024.

The authors find that since 2019, web-crawled, synthetic, and social media platforms like YouTube have become the primary sources of training datasets across modalities. This shift towards uncurated data sources reflects the demand for large, diverse, and timely data to train AI models effectively. However, these sources also come with significant challenges, including privacy risks, copyright issues, and potential biases, thus complicating the legality and ethical usage standards for dataset developers.

The paper highlights a notable inconsistency between the licenses provided for datasets and the restrictions imposed by data sources. While a minority of datasets explicitly carry non-commercial licenses, the source content of a vast majority mandates non-commercial use, bringing into question the clarity and transparency of data usage rights. Substantially, 99.8% of text dataset tokens have content with source-imposed restrictions, yet this is often inadequately documented in dataset licenses.

Moreover, despite an increase in the number of languages and geographical locales covered over time, the relative representation across datasets shows little improvement from a more Western-centric data sourcing to include other regions equitably. For instance, geographic diversity remains limited, with dataset creators predominantly based in North America, Europe, and East Asia, underscoring a persistent lack of representation from Africa and South America.

These findings have compelling theoretical and practical implications. Practitioners face arduous tasks in navigating data provenance and ensuring compliance with ethical and legal standards due to insufficient dataset transparency. Improving dataset documentation is crucial to supporting responsible AI development, which balances innovation with social responsibility.

In terms of future developments, the paper suggests that empirical audits such as this can lay groundwork for standardized data documentation frameworks. By releasing their audit, the authors contribute a valuable asset for ongoing research and a step toward greater transparency in the AI data ecosystem. There is an implicit call for improved data practices that encompass ethical sourcing, extensive representation, and clearer licensing congruent with source restrictions, all of which are paramount for fostering equitable and advanced AI systems.

Overall, this audit serves as a comprehensive diagnostic of prevalent practices in multimodal AI training data, with an emphasis on the crucial role data sourcing, privacy, and representation play in shaping AI technology's capabilities and ethical bounds.