BIOMEDICA: Open Biomedical Multimodal Dataset

Updated 11 August 2025

BIOMEDICA is a large-scale, open-access biomedical multimodal corpus featuring 24M image-caption pairs and 6M articles enriched with expert semantic annotations.
It employs a robust extraction pipeline with unsupervised clustering, expert-guided hierarchical labeling, and efficient data serialization for scalable AI training and retrieval.
The dataset supports state-of-the-art vision-language model benchmarking and biomedical retrieval applications with detailed metadata and optimized streaming APIs.

The BIOMEDICA dataset is a large-scale, open-access biomedical multimodal corpus constructed from the PubMed Central Open Access (PMC-OA) literature, comprising over 24 million image–caption pairs and more than 6 million full-text articles, each enriched with detailed metadata and expert semantic annotations. Developed to address the paucity of generalist vision-language resources in biomedicine, BIOMEDICA is designed for high-throughput streaming, scalable AI training, fine-grained concept retrieval, and advanced model benchmarking across the biomedical spectrum. Its construction utilizes a robust combination of unsupervised visual clustering, expert-guided hierarchical labeling, and efficient data serialization, establishing it as a foundational resource for biomedical vision-language modeling, retrieval, and generative AI research (Lozano et al., 13 Jan 2025, Lozano et al., 26 Mar 2025).

1. Corpus Composition and Metadata

BIOMEDICA is sourced from the full corpus of the PMC-OA subset, systematically extracting every figure (image) and its associated caption from eligible scientific articles. Each record is organized as an image–caption pair with an extensive set of metadata:

Bibliographic details: PMID, publication title, abstract, date, license, journal, and article type.
Content descriptors: figure captions, in-text figure references, keywords, and MeSH terms.
Expert annotation: Each pair is assigned at least one global class (from 13 high-level categories such as Clinical Imaging, Microscopy, Plots and Charts, Immuno Assays) and one or more local classes (170 fine-grained semantic types, e.g., x-ray radiography, chromatogram).
Technical statistics: Median caption length ~64 tokens; median image dimensions ~709×476 pixels; 22–27 metadata fields per record, depending on dataset version.

The dataset is serialized in two forms: (1) a query-friendly Parquet format and (2) WebDataset archives for stream-based loading. Each image–caption pair is a standalone record, enabling granular retrieval and fine-tuned dataset construction for diverse downstream tasks.

2. Extraction Pipeline and Data Serialization

The extraction workflow begins with enumerating every compressed nXML package and associated media directory in the PMC-OA FTP repository, using a master CSV index for randomized sampling and metadata synchronization. Key steps include:

Downloading nXML and image resources via ftplib, with backoff-aware throttling (three requests/second/IP).
Parsing each article’s nXML with XML libraries to extract titles, abstracts, full text, MeSH terms (via the Entrez API, batch-queried at up to 200 PMIDs/request), all captions, and inline figure references.
For images, only allowed file formats (primarily JPEG) are retained; each is linked to its parent caption and all relevant article-level metadata.
Records are serialized at the figure granularity. In the WebDataset format, each tar archive contains up to 10,000 image–caption pairs, each split into separate files (e.g., .jpg for the image, .txt for the caption, plus JSON for metadata). This design supports high-throughput, lazy streaming – facilitating efficient model training and large-scale retrieval without requiring local download of the entire dataset (27–30 TB size) (Lozano et al., 13 Jan 2025, Lozano et al., 26 Mar 2025).
The standardized format ensures compatibility with major deep learning frameworks, enabling seamless integration into multimodal training pipelines.

3. Hierarchical Concept Annotation and Clustering

To enrich each image–caption pair with fine-grained semantic labels, BIOMEDICA deploys a hybrid unsupervised and expert-guided annotation protocol:

Embedding: Visual content is embedded using DINOv2 (ViT-L/14 distilled), generating 1024-dimensional vectors.
Dimensionality reduction: Principal Component Analysis (PCA) is applied, with 25 principal components capturing 99% of the variance.
Clustering: K-means is run with K=2000, forming homogeneous visual clusters.
Expert labeling: Six scientists and two clinicians annotate multiple image samples from each cluster via online forms, assigning:
- Single/multi-panel status
- Global class (e.g., Microscopy, Plots)
- Local class (specific taxonomies within the global class)
Label propagation: Cluster-level annotations are mapped to all images in the cluster by cluster UID. Median inter-annotator agreement is near 0% on global labels; majority voting is used to resolve discrepancies.

These semantic annotations—derived from biomedical ontologies and reflecting expert judgment—support concept-balanced pretraining, targeted data filtering, and precise downstream evaluation.

4. Access, Streaming, and Search APIs

Given the dataset’s scale, BIOMEDICA equips researchers with multiple access modes:

Streaming: WebDataset-format archives allow sharded high-throughput streaming from cloud and local resources.
Multimodal Indexing: A vector database (e.g., ChromaDB) stores image and text embeddings (produced by foundation models) to support real-time nearest-neighbor retrieval.
Keyword Search: BM25s sparse indices are built over both figure captions and full text, optimized through vocabulary restriction and sparse-matrix accelerations.
API Design: Both vector and keyword search are made available through a unified API, enabling retrieval by text or image queries, and batch-format integration for model training or interactive applications.
Tutorials and pretrained pipelines (integrated with Hugging Face, provided codebase) lower the technical barrier for researchers and facilitate direct inclusion in training regimes (Lozano et al., 13 Jan 2025, Lozano et al., 26 Mar 2025).

5. Applications: Vision-Language Modeling and Benchmarking

The primary use cases of BIOMEDICA include the training and benchmarking of generalist vision-LLMs (VLMs) and support for advanced biomedical retrieval and generative modeling:

Contrastive Embedding Models: The BMC-CLIP framework utilizes the InfoNCE loss:

$\mathcal{L} = \frac{1}{2} \left(\mathcal{L}_I + \mathcal{L}_T\right)$

$\mathcal{L}_I = -\frac{1}{N} \sum_k \log\left[ \frac{\exp(S_{k,k})}{\sum_j \exp(S_{k,j})} \right],\quad S_{k,j} = \frac{\mathrm{sim}(z^\text{image}_k, z^\text{text}_j)}{\tau}$

where $z$ are modality-specific model embeddings, $\tau$ is a trainable temperature parameter; the loss is symmetrized across image-to-text and text-to-image directions.
Model Performance: Models trained on BIOMEDICA demonstrate state-of-the-art results in zero-shot classification (6.56% improvement on average; up to 29.8% in dermatology and 17.5% in ophthalmology) and in image-text/text-image retrieval (Recall@1, @10, @100), as well as on microscopy composition classification. Confidence intervals are computed via bootstrapping (1,000 resamples, 95% coverage).
Instruction-Following and Retrieval-Augmented Generation: By fine-tuning chat-style models (e.g., SmolVLM with LoRA and 10K instruction pairs from captions), the resource enables visual question answering (VQA) and biomedical-specific dialog generation. A Retrieval-Augmented Generation (RAG) pipeline further leverages the hybrid index for precise document retrieval, followed by multi-step summarization and answer synthesis, producing measurable improvements on guideline-based medical QA tasks (Lozano et al., 26 Mar 2025).

6. Research Impact and Utilization

BIOMEDICA’s release, along with pretrained models, evaluation benchmarks, and API implementations, has immediate implications for the biomedical AI research community:

Democratizes access to diverse, high-quality, annotated multimodal data spanning all primary biomedical subfields.
Supports robust, domain-specific concept balancing during model pretraining and evaluation, a crucial requirement for reliable generalist and specialized clinical AI systems.
By providing efficient streaming and search facilities coupled with exhaustive metadata, it empowers large-scale multimodal training, scalable benchmarking, and rapid iteration of both retrieval and generative models.
Its clustering and expert labeling methodology enables efficient, scalable semantic enrichment of millions of images with negligible manual annotation costs.

7. Limitations and Future Directions

While BIOMEDICA advances the state of accessible biomedical VLM resources, certain technical limitations and open challenges remain:

All images originate from published literature, with inherent domain biases favoring research-centric, rather than in situ clinical, imaging distributions.
Annotations depend on clustering quality and sample representativeness; further improvements in visual taxonomy design and sampling granularity may yield more nuanced semantic control.
As future iterations add more article sources or expand into multilingual corpora, ongoing evaluation of downstream model generalizability and bias mitigation will be necessary.
A plausible implication is that the existing technical infrastructure for streaming, annotation, and embedding will allow researchers to integrate expanded datasets (e.g., experimental or clinical case corpora) with minimal reengineering.

In summary, BIOMEDICA constitutes a foundational, expertly annotated, and technically robust open-source archive for vision-language modeling in the biomedical sciences (Lozano et al., 13 Jan 2025, Lozano et al., 26 Mar 2025). Its design and infrastructure enable scalable, domain-precise training and retrieval, supporting a broad spectrum of AI research and clinical translation efforts.