Dice Question Streamline Icon: https://streamlinehq.com

Automating metadata collection for published AI datasets

Develop automated methods to collect and verify dataset-level metadata, including source URLs, license information, and cryptographic checksums, for previously published AI training and fine-tuning datasets at web scale, to enable auditing and provenance tracking.

Information Square Streamline Icon: https://streamlinehq.com

Background

Large-scale AI datasets often lack complete metadata such as links to original sources or license details, complicating auditing, provenance, and compliance. The paper highlights the need for infrastructure that enables reliable analysis of terabyte-scale datasets, including mechanisms to ensure datasets have not been altered.

Automating metadata collection would facilitate responsible data practices, reduce legal ambiguity, and support robust dataset auditing by external parties.

References

An open research question is how the collection of metadata for previously published datasets could be automated.

Open Problems in Technical AI Governance (2407.14981 - Reuel et al., 20 Jul 2024) in Section 3.1.2 “Infrastructure and Metadata to Analyze Large Datasets”