BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing (2206.15076v1)

Published 30 Jun 2022 in cs.CL

Abstract: Training and evaluating LLMs increasingly requires the construction of meta-datasets --diverse collections of curated data with clear provenance. Natural language prompting has recently lead to improved zero-shot generalization by transforming existing, supervised datasets into a diversity of novel pretraining tasks, highlighting the benefits of meta-dataset curation. While successful in general-domain text, translating these data-centric approaches to biomedical LLMing remains challenging, as labeled biomedical datasets are significantly underrepresented in popular data hubs. To address this challenge, we introduce BigBIO a community library of 126+ biomedical NLP datasets, currently covering 12 task categories and 10+ languages. BigBIO facilitates reproducible meta-dataset curation via programmatic access to datasets and their metadata, and is compatible with current platforms for prompt engineering and end-to-end few/zero shot LLM evaluation. We discuss our process for task schema harmonization, data auditing, contribution guidelines, and outline two illustrative use cases: zero-shot evaluation of biomedical prompts and large-scale, multi-task learning. BigBIO is an ongoing community effort and is available at https://github.com/bigscience-workshop/biomedical

Citations (42)

View on Semantic Scholar

Summary

The paper introduces a framework that programmatically curates over 126 biomedical NLP datasets for reproducible meta-dataset creation.
It harmonizes task schemas to enable seamless integration with tools like Hugging Face and PromptSource for prompt engineering and multitask evaluations.
BigBIO enhances research by supporting zero-shot evaluations and multilingual datasets, reducing experimental setup costs in biomedical NLP.

An Overview of BigBIO: A Framework for Biomedical NLP

The paper "BigBIO: A Framework for Data-Centric Biomedical Natural Language Processing" introduces a comprehensive approach to facilitate the reproducible curation of meta-datasets for biomedical NLP tasks. The focus is on addressing the significant underrepresentation of labeled biomedical datasets in existing data hubs, a challenge that restricts the development of advanced LLMs for biomedical applications.

Key Contributions

BigBIO is a community-driven library comprising over 126 biomedical NLP datasets spanning 12 task categories and more than 10 languages. The library aims to streamline reproducible meta-dataset curation via programmatic access to datasets and their metadata, aligning with contemporary platforms for prompt engineering and evaluation in zero/few-shot settings.

The paper delineates several principal contributions:

Programmatic Access: Providing access to 126+ biomedical datasets through a unified interface, facilitating integration into machine learning workflows.
Task Schema Harmonization: Establishing lightweight schemas that support common NLP tasks, allowing for both preservation of the original dataset format and harmonized access for tasks like prompt engineering.
Community Infrastructure: Developing tools and guidelines to encourage contributions from the community, ensuring the library remains an extensible resource.
Integrations: BigBIO integrates with tools like Hugging Face’s datasets library and PromptSource, supporting comprehensive task evaluation and prompt engineering workflows.

Use Cases

Two illustrative use cases are described:

Zero-Shot Evaluation: The framework is employed to evaluate LLMs using biomedical prompts, showcasing improved zero-shot capabilities.
Large-Scale Multi-Task Learning (MTL): Demonstrating the efficiency gains from MTL on 100+ tasks, highlighting reductions in engineering costs and setup time for large-scale experiments.

Methodological Insights

BigBIO emphasizes the importance of a data-centric approach, stressing the value of curated, high-quality datasets. The paper discusses several foundational steps in dataset curation, including schema development, data integration, and quality assurance through unit testing. These methodologies ensure the library's contents are reliable and directly applicable to advanced LLMs.

Results and Implications

The framework shows promising results in facilitating biomedical NLP research:

Enhanced Generalization: Improved generalization in biomedical contexts, as demonstrated in zero-shot evaluations with models like T0++.
Support for Multilingual Datasets: Expanding biomedical research potential by including datasets in multiple languages beyond English.

BigBIO’s toolset allows researchers to more effectively experiment with prompt-based methodologies and multitask learning paradigms, critical areas in NLP's future development. The implications extend beyond practical applications, providing a robust infrastructure for theoretical advancements in understanding model generalization across diverse biomedical tasks.

Conclusion and Future Directions

BigBIO represents a significant step forward in the curation and accessibility of biomedical NLP datasets. By enabling reproducible workflows and supporting extensive dataset integration, BigBIO effectively bridges the gap between data scarcity and the development of capable biomedical LLMs.

Future work could focus on expanding the library with additional datasets, particularly from underrepresented languages and domains, further enhancing its utility as a resource for the biomedical NLP community. Efforts could also be directed at advancing prompt engineering tooling and better understanding multilingual NLP performance within biomedical applications.

PDF Markdown

Related Papers

GitHub

GitHub - bigscience-workshop/biomedical: Tools for curating biomedical training data for large-scale language modeling (459 stars)