Automating metadata collection for published AI datasets
Develop automated methods to collect and verify dataset-level metadata, including source URLs, license information, and cryptographic checksums, for previously published AI training and fine-tuning datasets at web scale, to enable auditing and provenance tracking.
References
An open research question is how the collection of metadata for previously published datasets could be automated.
— Open Problems in Technical AI Governance
(2407.14981 - Reuel et al., 20 Jul 2024) in Section 3.1.2 “Infrastructure and Metadata to Analyze Large Datasets”