Using Large Language Models to Enrich the Documentation of Datasets for Machine Learning (2404.15320v2)
Abstract: Recent regulatory initiatives like the European AI Act and relevant voices in the Machine Learning (ML) community stress the need to describe datasets along several key dimensions for trustworthy AI, such as the provenance processes and social concerns. However, this information is typically presented as unstructured text in accompanying documentation, hampering their automated analysis and processing. In this work, we explore using LLMs (LLM) and a set of prompting strategies to automatically extract these dimensions from documents and enrich the dataset description with them. Our approach could aid data publishers and practitioners in creating machine-readable documentation to improve the discoverability of their datasets, assess their compliance with current AI regulations, and improve the overall quality of ML models trained on them. In this paper, we evaluate the approach on 12 scientific dataset papers published in two scientific journals (Nature's Scientific Data and Elsevier's Data in Brief) using two different LLMs (GPT3.5 and Flan-UL2). Results show good accuracy with our prompt extraction strategies. Concrete results vary depending on the dimensions, but overall, GPT3.5 shows slightly better accuracy (81,21%) than FLAN-UL2 (69,13%) although it is more prone to hallucinations. We have released an open-source tool implementing our approach and a replication package, including the experiments' code and results, in an open-source repository.
- A vast dataset for kurdish handwritten digits and isolated characters recognition. Data in Brief, 47:109014.
- Systematic literature review of information extraction from textual data: Recent methods, applications, trends, and challenges. IEEE Access, 11:10535–10562.
- Dataset of prostate mri annotated for anatomical zones and cancer. Data in Brief, 45:108739.
- An annotated dataset for event-based surveillance of antimicrobial resistance. Data in Brief, 46:108870.
- Data statements for natural language processing: Toward mitigating system bias and enabling better science. Transactions of the Association for Computational Linguistics, 6:587–604.
- Data management plan implementation, assessments, and evaluations: Implications and recommendations. Data Science Journal, 22:27.
- Analysing the requirements for an open research knowledge graph: use cases, quality requirements, and construction strategies. International Journal on Digital Libraries, 23(1):33–55.
- Google dataset search: Building a search engine for datasets in an open web ecosystem. In The World Wide Web Conference, WWW ’19, page 1365–1375, New York, NY, USA. Association for Computing Machinery.
- Chase, H. (2023). LangChain main repository. Accessed: November 2023.
- The data paper: a mechanism to incentivize data publishing in biodiversity science. BMC bioinformatics, 12:1–12.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Behavioral use licensing for responsible AI. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, page 778–788, New York, NY, USA. Association for Computing Machinery.
- Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271.
- Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
- Crowdworksheets: Accounting for individual and collective identities underlying crowdsourced dataset annotation. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 2342–2351, New York, NY, USA. Association for Computing Machinery.
- The leaf clinical trials corpus: a new resource for query generation from clinical trial eligibility criteria. Scientific Data, 9(1):490.
- Structured information extraction from complex scientific text with fine-tuned large language models. arXiv preprint arXiv:2212.05238.
- ’Yes, I comply!’ motivations and practices around research data management and reuse across scientific fields. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW2):1–26.
- Identifying used methods and datasets in scientific publications. In Proceedings of the Workshop on Scientific Document Understanding: co-located with 35th AAAI Conference on Artificial Inteligence (AAAI 2021) ; Remote, February 9, 2021., volume 2831 of CEUR Workshop Proceedings. RWTH Aachen.
- A whole-body fdg-pet/ct dataset with manually annotated tumor lesions. Scientific Data, 9(1):601.
- Datasheets for datasets. Communications of the ACM, 64(12):86–92.
- Describeml: a tool for describing machine learning datasets. In Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems: Companion Proceedings, pages 22–26.
- Code repository of: Datadoc analyzer: A tool for analyzing the documentation of scientific datasets. https://github.com/SOM-Research/DataDoc-Analyzer. Accessed on 23.11.2023.
- Datadoc analyzer docker image. https://hub.docker.com/r/joangi/datadoc_analyzer. Accessed on 23.11.2023.
- Datadoc analyzer public demo. https://huggingface.co/spaces/JoanGiner/DataDoc_Analyzer. Accessed on 23.11.2023.
- Repository of the replication package and data generated of this work. https://github.com/SOM-Research/Dataset-Docs-Enrichment. Accessed on 23.12.2023.
- A domain-specific language for describing machine learning datasets. Journal of Computer Languages, 76:101209.
- The dataset nutrition label. Data Protection and Privacy, 12:1–26.
- Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880.
- Atlas: Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
- Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12):38.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
- Orkg-leaderboards: a systematic workflow for mining leaderboards as a knowledge graph. International Journal on Digital Libraries, pages 1–14.
- Human-annotated dataset for social media sentiment analysis for albanian language. Data in Brief, 43:108436.
- Prompting strategies for citation classification. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, CIKM ’23, page 1127–1137, New York, NY, USA. Association for Computing Machinery.
- Gender imbalance in medical imaging datasets produces biased classifiers for computer-aided diagnosis. Proceedings of the National Academy of Sciences, 117(23):12592–12594.
- BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Jurafsky, D., Chai, J., Schluter, N., and Tetreault, J., editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
- Advances, challenges and opportunities in creating data for trustworthy ai. Nature Machine Intelligence, 4(8):669–677.
- A framework for deprecating datasets: Standardizing documentation, identification, and communication. In 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22, page 199–212, New York, NY, USA. Association for Computing Machinery.
- From “what” to “how”: Extracting the procedural scientific information toward the metric-optimization in ai. Information Processing & Management, 60(3):103315.
- On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919.
- Reusable templates and guides for documenting datasets and models for natural language processing and generation: A case study of the HuggingFace and GEM data and model cards. In Proceedings of the 1st Workshop on Natural Language Generation, Evaluation, and Metrics, pages 121–135, Online. ACM.
- The landscape of data and AI documentation approaches in the european policy context. Ethics and Information Technology, 25(4):56.
- MLCommons (2023). Croissant: a high-level format for machine learning datasets. https://github.com/mlcommons/croissant. Accessed: November 2023.
- Dsail-porini: Annotated camera trap image data of wildlife species from a conservancy in kenya. Data in Brief, 46:108863.
- An annotated image dataset for training mosquito species recognition system on human skin. Scientific Data, 9(1):413.
- Gsap-ner: A novel task, corpus, and baseline for scholarly entity extraction focused on machine learning models and datasets. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8166–8176.
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, NeurIPS, volume 35, pages 27730–27744. Curran Associates, Inc.
- A speech corpus of quechua collao for automatic dimensional emotion recognition. Scientific Data, 9(1):778.
- Automated extraction of molecular interactions and pathway knowledge using large language model, galactica: Opportunities and challenges. In Demner-fushman, D., Ananiadou, S., and Cohen, K., editors, The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 255–264, Toronto, Canada. Association for Computational Linguistics.
- Automatic extraction of fair data from publications using LLM. Cambridge: Cambridge Open Engage. This content is a preprint and has not been peer-reviewed.
- A stance dataset with aspect-based sentiment information from indonesian covid-19 vaccination-related tweets. Data in Brief, 47:108951.
- GROBID - Information Extraction from Scientific Publications. ERCIM News, 100.
- A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Scientific data, 8(1):34.
- Sciharvester: Searching scientific documents for numerical values. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’23, page 3135–3139, New York, NY, USA. Association for Computing Machinery.
- Few-shot text generation with natural language instructions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 390–402.
- Retrieval augmentation reduces hallucination in conversation. In Moens, M., Huang, X., Specia, L., and Yih, S. W., editors, Findings of the Association for Computational Linguistics: EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 16-20 November, 2021, pages 3784–3803. Association for Computational Linguistics.
- Deeplontar dataset for handwritten balinese character detection and syllable recognition on lontar manuscript. Scientific Data, 9(1):761.
- UL2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations.
- Data sharing practices and data availability upon request differ across scientific disciplines. Scientific data, 8(1):192.
- Data management documentation in citizen science projects: Bringing formalisation and transparency together. Citizen Science: Theory and Practice, 8(1):25.
- A broad-coverage challenge corpus for sentence understanding through inference. In Walker, M., Ji, H., and Stent, A., editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
- In-context instruction learning. Proceedings of the AAAI Conference on Artificial Intelligence. note: to be published.
- Generate rather than retrieve: Large language models are strong context generators. In The Eleventh International Conference on Learning Representations.
- Revealing the technology development of natural language processing: A scientific entity-centric perspective. Information Processing & Management, 61(1):103574.
- Improving on-line scientific resource profiling by exploiting resource citation information in the literature. Information Processing & Management, 58(5):102638.
Collections
Sign up for free to add this paper to one or more collections.