Papers
Topics
Authors
Recent
Search
2000 character limit reached

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature

Published 13 Jan 2025 in cs.CV and cs.CL | (2501.07171v3)

Abstract: The development of vision-LLMs (VLMs) is driven by large-scale and diverse multimodal datasets. However, progress toward generalist biomedical VLMs is limited by the lack of annotated, publicly accessible datasets across biology and medicine. Existing efforts are restricted to narrow domains, missing the full diversity of biomedical knowledge encoded in scientific literature. To address this gap, we introduce BIOMEDICA, a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset. Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. Metadata and expert-guided annotations are also provided. We demonstrate the utility and accessibility of our resource by releasing BMCA-CLIP, a suite of CLIP-style models continuously pre-trained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally. On average, our models achieve state-of-the-art performance across 40 tasks - spanning pathology, radiology, ophthalmology, dermatology, surgery, molecular biology, parasitology, and cell biology - excelling in zero-shot classification with a 6.56% average improvement (as high as 29.8% and 17.5% in dermatology and ophthalmology, respectively), and stronger image-text retrieval, all while using 10x less compute. To foster reproducibility and collaboration, we release our codebase and dataset for the broader research community.

Summary

  • The paper presents an open-source framework that curates over 24M biomedical image-caption pairs from PubMed Central, enabling efficient VLM training.
  • It employs a sophisticated curation pipeline that extracts metadata and clusters images into 12 global and 170 local concepts with 27 unique metadata fields.
  • The BMCA-CLIP models trained on BIOMEDICA achieve up to 29.8% improvement in dermatology tasks while reducing compute requirements by 10x.

BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-LLMs Derived from Scientific Literature

Introduction

The development of generalist vision-LLMs (VLMs) for biomedical domains has been impeded by the lack of extensive and publicly accessible datasets that span the diverse fields of biology and medicine. "BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-LLMs Derived from Scientific Literature" (2501.07171) presents a solution by offering an open-source framework that curates over 24 million unique image-text pairs from the PubMed Central Open Access subset, encompassing more than 6 million articles. By leveraging this extensive dataset, the framework enables the creation of generalized biomedical VLMs that promise expert-level performance across numerous tasks while requiring substantially less computational resource. Figure 1

Figure 1: Overlap of BIOMEDICA dataset with the landscape of biomedical research, highlighting the diversity of topics covered.

Curation and Scope of BIOMEDICA

BIOMEDICA employs a sophisticated curation pipeline to extract and organize metadata, captions, and images from the massive corpus of the PMC-OA repository. The pipeline involves several key stages: extraction of metadata and initial data processing, feature generation using DINO v2, and annotation of clusters by domain experts, resulting in the identification of 12 global concepts and 170 finer-grained local concepts. Figure 2

Figure 2: BIOMEDICA curation pipeline showcasing the extract, transform, and load stages.

Each data point in the BIOMEDICA dataset is enriched with over 27 unique metadata fields, facilitating detailed querying and filtering. For computational efficiency, the dataset is stored in a dual format, supporting both high-throughput streaming via WebDataset and querying through Parquet, achieving notable input/output rate improvements.

Model Development and Performance

The release of BMCA-CLIP models underscores the applicability of BIOMEDICA in training VLMs. These models are fine-tuned via streaming on the BIOMEDICA dataset, which avoids the need for local downloading of the voluminous data. The models achieve notable performance gains, with an average improvement of 6.56% over prior state-of-the-art models, reaching up to 29.8% in dermatology tasks alone. Figure 3

Figure 3: Average model performance of best BMCA-CLIP models compared to prior work, illustrating significant advancements across different biomedical tasks.

This performance is coupled with a reduction in computational demand, utilizing 10 times less compute compared to predecessor models, thanks to the strategic data curation and high data throughput achieved by BIOMEDICA's architecture.

Evaluation and Deployment Implications

The development outcomes include extensive evaluation benchmarks covering 40 established biomedical datasets, which exhibit comprehensive assessment properties across pathology, radiology, molecular biology, and more. This ensures a robust evaluation of the VLMs against a comprehensive suite of retrieval and classification tasks, affirming their effectiveness and generalizability. Figure 4

Figure 4

Figure 4: A selection of diverse image types from the BIOMEDICA dataset and the structure of its taxonomy.

The models are optimized for zero-shot inference, crucial for scenarios where computational resources are limited, and where rapid deployment is critical. Therefore, BIOMEDICA offers a significant contribution to the democratization of AI in biomedicine, providing a comprehensive resource for developing highly efficient, adaptable models across a variety of biomedical domains.

Conclusion

BIOMEDICA presents a critical step forward in the domain of biomedical vision-language modeling, offering a unique dataset with an extensive scale, rich metadata, and optimized access strategies, invaluable for both foundational model training and downstream application development. Future endeavors may focus on expanding the dataset and exploring further integration with other biomedical resources to enhance model diversity and accuracy. As the dataset is available to the research community, BIOMEDICA is positioned to significantly advance the field of biomedical AI across both academia and industry fronts.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 17 likes about this paper.