Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

279

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine (2408.02900v1)

Published 6 Aug 2024 in cs.CV

Abstract: This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine, covering over 25 million images across 10 modalities, with multigranular annotations for more than 65 diseases. These enriched annotations encompass both global textual information, such as disease/lesion type, modality, region-specific descriptions, and inter-regional relationships, as well as detailed local annotations for regions of interest (ROIs), including bounding boxes, segmentation masks. Unlike existing approach which is limited by the availability of image-text pairs, we have developed the first automated pipeline that scales up multimodal data by generating multigranular visual and texual annotations (in the form of image-ROI-description triplets) without the need for any paired text descriptions. Specifically, data from over 90 different sources have been collected, preprocessed, and grounded using domain-specific expert models to identify ROIs related to abnormal regions. We then build a comprehensive knowledge base and prompt multimodal LLMs to perform retrieval-augmented generation with the identified ROIs as guidance, resulting in multigranular texual descriptions. Compared to existing datasets, MedTrinity-25M provides the most enriched annotations, supporting a comprehensive range of multimodal tasks such as captioning and report generation, as well as vision-centric tasks like classification and segmentation. Pretraining on MedTrinity-25M, our model achieves state-of-the-art performance on VQA-RAD and PathVQA, surpassing both multimodal LLMs and other representative SoTA approaches. This dataset can also be utilized to support large-scale pre-training of multimodal medical AI models, contributing to the development of future foundation models in the medical domain.

PDF HTML Abstract

MedTrinity-25M: A Comprehensive Multimodal Dataset for Medical AI

Overview

The introduction of MedTrinity-25M marks a significant advancement in the availability and richness of medical datasets for AI research. This dataset comprises over 25 million images spanning 10 modalities and covering more than 65 diseases. Each image is paired with detailed multigranular annotations, including descriptions of disease types, regions of interest (ROIs), modality information, region-specific descriptions, and inter-regional relationships. Unlike traditional datasets that often rely on paired image-text datasets, MedTrinity-25M employs an automated pipeline that generates annotations from unpaired images, thus significantly scaling up the data.

Dataset Construction

Data Collection

MedTrinity-25M aggregates data from over 90 sources, including well-known repositories such as TCIA, Kaggle, Zenodo, and Synapse. This extensive collection encompasses various imaging modalities, including X-ray, MRI, CT, Ultrasound, and Histopathology, ensuring comprehensive coverage of medical imaging techniques. The data sources include images annotated with different levels of detail, from broad disease types to precisely marked segmentation masks and bounding boxes.

Annotation Strategy

Metadata Integration: Basic image attributes, such as modality and disease types, are derived from existing dataset metadata. This metadata is used to generate "coarse captions," which provide essential contextual information for each image.
ROI Locating: Various expert models (e.g., SAT, Chexmask, HoverNet) are leveraged to identify ROIs within the images. These models either use text prompts or segmentation techniques to locate regions indicative of abnormalities.
Medical Knowledge Retrieval: To enhance the quality of textual descriptions, external medical knowledge is integrated. This knowledge is retrieved from databases such as PubMed and StatPearls, ensuring that the annotations are infused with domain-specific expertise.

Automated Annotation Pipeline

The automated pipeline for annotation bypasses the need for paired image-text data, instead using domain-specific expert models and large multimodal LLMs (MLLMs). The pipeline consists of two major stages:

Data Processing: This stage involves preprocessing the data to extract coarse captions, locate ROIs, and retrieve relevant medical knowledge. These elements provide a foundation upon which detailed annotations can be built.
Generation of Multigranular Text Descriptions: Using the processed data, MLLMs (such as GPT-4V and LLaVA-Med Captioner) are prompted to generate structured, multigranular text descriptions. These descriptions offer a layered understanding of the image, integrating global and local information.

Evaluation and Quality

To ensure the generated annotations are of high quality and align well with human-generated annotations, the dataset was evaluated using GPT-4V. This evaluation focused on five key attributes: modality, structure detection, ROI analysis, lesion texture, and local-global relationships. The alignment scores indicate a high degree of agreement with human annotations, validating the dataset's reliability.

Benchmarking with MedTrinity-25M

The efficacy of MedTrinity-25M was demonstrated through the training of LLaVA-Med++, a state-of-the-art model for medical visual question answering (VQA). Pretraining on MedTrinity-25M led to significant improvements in performance across multiple VQA benchmarks (VQA-RAD, SLAKE, and PathVQA). These results underscore the dataset's potential to enhance the capabilities of multimodal medical AI models.

Practical Implications and Future Directions

By providing a large-scale, richly annotated dataset, MedTrinity-25M significantly lowers the barrier for training advanced AI models in medicine. Its comprehensive coverage across various modalities and diseases makes it a invaluable resource for developing AI models that can perform a multitude of tasks, from diagnostic imaging to automated report generation. Future developments could include expanding the dataset with additional modalities and diseases and further refining the annotation pipeline to incorporate evolving AI technologies and medical knowledge bases.

In summary, MedTrinity-25M addresses the critical need for large, detailed multimodal datasets in medical AI. Its automated pipeline for annotation, combined with the dataset's breadth and depth, positions it as a cornerstone resource for the next generation of medical AI research and applications.

PDF Markdown Bookmark Chat (Pro)

References (71)

Authors (11)

Yunfei Xie (9 papers)
Ce Zhou (11 papers)
Lang Gao (14 papers)
Juncheng Wu (11 papers)
Xianhang Li (20 papers)
Hong-Yu Zhou (50 papers)
Sheng Liu (122 papers)
Lei Xing (83 papers)
James Zou (232 papers)
Cihang Xie (91 papers)
Yuyin Zhou (92 papers)

Citations (9)

View on Semantic Scholar

Tweets

https://twitter.com/_akhaliq/status/1821006579698675914

https://twitter.com/yuyinzhou_cs/status/1824200018007560206

https://twitter.com/vanstriendaniel/status/1821087731570897246

https://twitter.com/fly51fly/status/1821300581518160156

https://twitter.com/arXivGPT/status/1821629445729309164

https://twitter.com/CSVisionPapers/status/1821166063591031256