BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-Language Models Derived from Scientific Literature (2501.07171v3)

Published 13 Jan 2025 in cs.CV and cs.CL

Abstract: The development of vision-LLMs (VLMs) is driven by large-scale and diverse multimodal datasets. However, progress toward generalist biomedical VLMs is limited by the lack of annotated, publicly accessible datasets across biology and medicine. Existing efforts are restricted to narrow domains, missing the full diversity of biomedical knowledge encoded in scientific literature. To address this gap, we introduce BIOMEDICA, a scalable, open-source framework to extract, annotate, and serialize the entirety of the PubMed Central Open Access subset into an easy-to-use, publicly accessible dataset. Our framework produces a comprehensive archive with over 24 million unique image-text pairs from over 6 million articles. Metadata and expert-guided annotations are also provided. We demonstrate the utility and accessibility of our resource by releasing BMCA-CLIP, a suite of CLIP-style models continuously pre-trained on the BIOMEDICA dataset via streaming, eliminating the need to download 27 TB of data locally. On average, our models achieve state-of-the-art performance across 40 tasks - spanning pathology, radiology, ophthalmology, dermatology, surgery, molecular biology, parasitology, and cell biology - excelling in zero-shot classification with a 6.56% average improvement (as high as 29.8% and 17.5% in dermatology and ophthalmology, respectively), and stronger image-text retrieval, all while using 10x less compute. To foster reproducibility and collaboration, we release our codebase and dataset for the broader research community.

Summary

The paper introduces BIOMEDICA, detailing a dataset of over 24 million image-caption pairs to advance biomedical vision-language models.
It employs a blend of automated DINO v2 clustering and expert manual annotation to classify images into 12 global and 170 local biomedical concepts.
BMCA-CLIP models trained on this dataset achieve notable zero-shot classification gains, with improvements up to +29.8% in dermatology and +17.5% in ophthalmology.

Overview of BIOMEDICA: A Biomedical Image-Caption Dataset

The paper "BIOMEDICA: An Open Biomedical Image-Caption Archive, Dataset, and Vision-LLMs Derived from Scientific Literature" presents a novel approach to developing and evaluating vision-LLMs (VLMs) in the biomedical domain. This work addresses the significant gap in existing datasets by providing a comprehensive, annotated, and publicly accessible dataset derived from the PubMed Central Open Access (PMC-OA) repository, which contains a broad spectrum of biomedical literature.

Core Contributions and Dataset Description

BIOMEDICA introduces an extensive framework for the extraction, annotation, and serialization of PMC-OA articles into a structured dataset comprising over 24 million unique image-caption pairs derived from more than six million articles. The dataset is enhanced with 27 metadata fields and is designed to be easily accessible to the research community through platforms like Hugging Face. This large-scale dataset spans a wide range of biomedical areas, including pathology, radiology, ophthalmology, dermatology, molecular biology, and more, encompassing various image modalities such as microscopy, clinical imaging, and illustrative diagrams.

A significant innovation of BIOMEDICA is the extent and depth of expert annotations, which classify images into a taxonomy of 12 global concepts and 170 local concepts. The annotation process involves both automated clustering based on DINO v2 features and manual expert input to ensure coverage and relevance across biomedical subdomains.

Vision-LLMs and Evaluation

To demonstrate the utility of the BIOMEDICA dataset, the authors introduce BMCA-CLIP— a suite of models continually pre-trained on the dataset. These models, leveraging a CLIP-style architecture, are capable of zero-shot classification and exhibit competitive performance against existing state-of-the-art models across 40 benchmark tasks in various biomedical fields. Notably, BMCA-CLIP models achieve an impressive average zero-shot classification improvement of 6.56%, with substantial enhancements in specific fields like dermatology (+29.8%) and ophthalmology (+17.5%).

The paper presents a robust evaluation framework, transforming classical classification tasks into closed VQA (Visual Question Answering) formulations. This methodology enables the assessment of model performance in a manner that underscores their ability to generalize across numerous biomedical imaging tasks.

Implications and Future Directions

The development of the BIOMEDICA dataset holds promising implications for the future of AI in the biomedical field. The accessibility and comprehensiveness of the dataset address significant barriers to entry in developing robust VLMs for biomedical applications. By providing a framework that supports streaming-based pre-training and efficient metadata filtering, BIOMEDICA sets the stage for scalable and efficient model training, essential for the resource-constrained environments commonly found in academic research.

The dataset and associated models have the potential to bridge informational gaps in clinical settings by enhancing the retrieval of visual and textual knowledge from scientific literature. Such capabilities could streamline clinical decision-making, assist in forensic diagnosis, or expedite biomedical research by linking visual observations to textual insights.

Considering future developments, the authors suggest that further exploration into model architectures, like expanding context-length limits and improving handling of varied image qualities, could yield additional performance gains. Furthermore, the continual updating of the dataset in alignment with the growing body of biomedical literature will ensure ongoing relevance and applicability.

In conclusion, BIOMEDICA represents a significant step towards democratizing access to large-scale, high-quality biomedical datasets, accelerating the integration of AI technologies in healthcare, and fostering collaboration throughout the scientific community.

PDF Markdown

Related Papers

Tweets

https://twitter.com/razoralign/status/1879566919189307713