Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine (2408.02900v1)

Published 6 Aug 2024 in cs.CV
MedTrinity-25M: A Large-scale Multimodal Dataset with Multigranular Annotations for Medicine

Abstract: This paper introduces MedTrinity-25M, a comprehensive, large-scale multimodal dataset for medicine, covering over 25 million images across 10 modalities, with multigranular annotations for more than 65 diseases. These enriched annotations encompass both global textual information, such as disease/lesion type, modality, region-specific descriptions, and inter-regional relationships, as well as detailed local annotations for regions of interest (ROIs), including bounding boxes, segmentation masks. Unlike existing approach which is limited by the availability of image-text pairs, we have developed the first automated pipeline that scales up multimodal data by generating multigranular visual and texual annotations (in the form of image-ROI-description triplets) without the need for any paired text descriptions. Specifically, data from over 90 different sources have been collected, preprocessed, and grounded using domain-specific expert models to identify ROIs related to abnormal regions. We then build a comprehensive knowledge base and prompt multimodal LLMs to perform retrieval-augmented generation with the identified ROIs as guidance, resulting in multigranular texual descriptions. Compared to existing datasets, MedTrinity-25M provides the most enriched annotations, supporting a comprehensive range of multimodal tasks such as captioning and report generation, as well as vision-centric tasks like classification and segmentation. Pretraining on MedTrinity-25M, our model achieves state-of-the-art performance on VQA-RAD and PathVQA, surpassing both multimodal LLMs and other representative SoTA approaches. This dataset can also be utilized to support large-scale pre-training of multimodal medical AI models, contributing to the development of future foundation models in the medical domain.

MedTrinity-25M: A Comprehensive Multimodal Dataset for Medical AI

Overview

The introduction of MedTrinity-25M marks a significant advancement in the availability and richness of medical datasets for AI research. This dataset comprises over 25 million images spanning 10 modalities and covering more than 65 diseases. Each image is paired with detailed multigranular annotations, including descriptions of disease types, regions of interest (ROIs), modality information, region-specific descriptions, and inter-regional relationships. Unlike traditional datasets that often rely on paired image-text datasets, MedTrinity-25M employs an automated pipeline that generates annotations from unpaired images, thus significantly scaling up the data.

Dataset Construction

Data Collection

MedTrinity-25M aggregates data from over 90 sources, including well-known repositories such as TCIA, Kaggle, Zenodo, and Synapse. This extensive collection encompasses various imaging modalities, including X-ray, MRI, CT, Ultrasound, and Histopathology, ensuring comprehensive coverage of medical imaging techniques. The data sources include images annotated with different levels of detail, from broad disease types to precisely marked segmentation masks and bounding boxes.

Annotation Strategy

  1. Metadata Integration: Basic image attributes, such as modality and disease types, are derived from existing dataset metadata. This metadata is used to generate "coarse captions," which provide essential contextual information for each image.
  2. ROI Locating: Various expert models (e.g., SAT, Chexmask, HoverNet) are leveraged to identify ROIs within the images. These models either use text prompts or segmentation techniques to locate regions indicative of abnormalities.
  3. Medical Knowledge Retrieval: To enhance the quality of textual descriptions, external medical knowledge is integrated. This knowledge is retrieved from databases such as PubMed and StatPearls, ensuring that the annotations are infused with domain-specific expertise.

Automated Annotation Pipeline

The automated pipeline for annotation bypasses the need for paired image-text data, instead using domain-specific expert models and large multimodal LLMs (MLLMs). The pipeline consists of two major stages:

  1. Data Processing: This stage involves preprocessing the data to extract coarse captions, locate ROIs, and retrieve relevant medical knowledge. These elements provide a foundation upon which detailed annotations can be built.
  2. Generation of Multigranular Text Descriptions: Using the processed data, MLLMs (such as GPT-4V and LLaVA-Med Captioner) are prompted to generate structured, multigranular text descriptions. These descriptions offer a layered understanding of the image, integrating global and local information.

Evaluation and Quality

To ensure the generated annotations are of high quality and align well with human-generated annotations, the dataset was evaluated using GPT-4V. This evaluation focused on five key attributes: modality, structure detection, ROI analysis, lesion texture, and local-global relationships. The alignment scores indicate a high degree of agreement with human annotations, validating the dataset's reliability.

Benchmarking with MedTrinity-25M

The efficacy of MedTrinity-25M was demonstrated through the training of LLaVA-Med++, a state-of-the-art model for medical visual question answering (VQA). Pretraining on MedTrinity-25M led to significant improvements in performance across multiple VQA benchmarks (VQA-RAD, SLAKE, and PathVQA). These results underscore the dataset's potential to enhance the capabilities of multimodal medical AI models.

Practical Implications and Future Directions

By providing a large-scale, richly annotated dataset, MedTrinity-25M significantly lowers the barrier for training advanced AI models in medicine. Its comprehensive coverage across various modalities and diseases makes it a invaluable resource for developing AI models that can perform a multitude of tasks, from diagnostic imaging to automated report generation. Future developments could include expanding the dataset with additional modalities and diseases and further refining the annotation pipeline to incorporate evolving AI technologies and medical knowledge bases.

In summary, MedTrinity-25M addresses the critical need for large, detailed multimodal datasets in medical AI. Its automated pipeline for annotation, combined with the dataset's breadth and depth, positions it as a cornerstone resource for the next generation of medical AI research and applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. Towards generalist biomedical ai. NEJM AI, 1(3):AIoa2300138, 2024.
  4. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  5. A generalist learner for multifaceted medical image interpretation. arXiv preprint arXiv:2405.07988, 2024.
  6. Padchest: A large chest x-ray image dataset with multi-label annotated reports. Medical image analysis, 66:101797, 2020.
  7. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019.
  8. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs. arXiv preprint arXiv:1901.07042, 2019.
  9. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36, 2024.
  10. Quilt-1m: One million image-text pairs for histopathology. Advances in Neural Information Processing Systems, 36, 2024.
  11. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR, 2023.
  12. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  13. Palm-e: An embodied multimodal language model. In International Conference on Machine Learning, pages 8469–8488. PMLR, 2023.
  14. Capabilities of gemini models in medicine. arXiv preprint arXiv:2404.18416, 2024.
  15. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  16. Radiology objects in context (roco): a multimodal image dataset. In Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 3, pages 180–189. Springer, 2018.
  17. Radgenome-chest ct: A grounded vision-language dataset for chest ct analysis. arXiv preprint arXiv:2404.16754, 2024.
  18. Pmc-clip: Contrastive language-image pre-training using biomedical documents. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 525–536. Springer, 2023.
  19. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019.
  20. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE, 2021.
  21. Deeplesion: Automated deep mining, categorization and detection of significant radiology image findings using large-scale clinical lesion annotations. arXiv preprint arXiv:1710.01766, 2017.
  22. axiong/pmc_oa datasets at hugging face. https://huggingface.co/datasets/axiong/pmc_oa.
  23. Federated benchmarking of medical artificial intelligence with medperf. Nature Machine Intelligence, 5(7):799–810, 2023.
  24. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018), pages 168–172. IEEE, 2018.
  25. A patient-centric dataset of images and metadata for identifying melanomas using clinical context. Scientific data, 8(1):34, 2021.
  26. Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415, 2023.
  27. One model to rule them all: Towards universal segmentation for medical images with text prompts. arXiv preprint arXiv:2312.17183, 2023.
  28. Boundary-aware transformers for skin lesion segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24, pages 206–216. Springer, 2021.
  29. Improving anatomical plausibility in medical image segmentation via hybrid graph neural networks: applications to chest x-ray analysis. IEEE Transactions on Medical Imaging, 2022.
  30. Benchmarking retrieval-augmented generation for medicine. arXiv preprint arXiv:2402.13178, 2024.
  31. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021.
  32. Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval. Bioinformatics, 39(11):btad651, 2023.
  33. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  34. Meta LLaMA Team. Introducing meta llama 3: The most capable openly available llm to date. https://ai.meta.com/blog/meta-llama-3/, 2024.
  35. When do we not need larger vision models? arXiv preprint arXiv:2403.13043, 2024.
  36. Sa-med2d-20m dataset: Segment anything in 2d medical imaging with 20 million masks. arXiv preprint arXiv:2311.11969, 2023.
  37. Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 10(1):41, 2023.
  38. The 2024 brain tumor segmentation (brats) challenge: Glioma segmentation on post-treatment mri. arXiv preprint arXiv:2405.18368, 2024.
  39. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018.
  40. 100,000 histological images of human colorectal cancer and healthy tissue. https://doi.org/10.5281/zenodo.1214456.
  41. Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286, 2020.
  42. Vision–language model for visual question answering in medical imagery. Bioengineering, 10(3):380, 2023.
  43. Q2atransformer: Improving medical vqa via an answer querying decoder. In International Conference on Information Processing in Medical Imaging, pages 445–456. Springer, 2023.
  44. Open-ended medical visual question answering through prefix tuning of language models. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 726–736. Springer, 2023.
  45. Pubmedclip: How much does clip benefit visual question answering in the medical domain? In Findings of the Association for Computational Linguistics: EACL 2023, pages 1181–1193, 2023.
  46. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. arXiv preprint arXiv:2303.00915, 2023.
  47. Self-supervised vision-language pretraining for medial visual question answering. In 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2023.
  48. Datasheets for datasets. Communications of the ACM, 64(12):86–92, 2021.
  49. Brain hemorrhage extended (bhx): Bounding box extrapolation from thick to thin slice ct images. PhysioNet, 101(23):e215–20, 2020.
  50. Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. Journal of pathology informatics, 7(1):29, 2016.
  51. Automatic detection of invasive ductal carcinoma in whole slide images with convolutional neural networks. In Medical Imaging 2014: Digital Pathology. SPIE, March 2014.
  52. A large-scale synthetic pathological dataset for deep learning-enabled segmentation of breast cancer. Scientific Data, 10(1):231, 2023.
  53. Pannuke dataset extension, insights and baselines. arXiv preprint arXiv:2003.10778, 2020.
  54. Semantic modeling of cell damage prediction: a machine learning approach at human-level performance in dermatology. Scientific Reports, 13(1):8336, 2023.
  55. A foundation model utilizing chest ct volumes and radiology reports for supervised-level zero-shot detection of abnormalities. arXiv preprint arXiv:2403.17834, 2024.
  56. Jun Ma and Bo Wang. Miccai flare23: Fast, low-resource, and accurate organ and pan-cancer segmentation in abdomen ct, Apr 2023.
  57. Predicting ki67, er, pr, and her2 statuses from h&e-stained breast cancer images. arXiv preprint arXiv:2308.01982, 2023.
  58. Precise segmental renal artery clamping under the guidance of dual-source computed tomography angiography during laparoscopic partial nephrectomy. Eur. Urol., 62(6):1001–1008, December 2012.
  59. Laparoscopic partial nephrectomy with segmental renal artery clamping: technique and clinical outcomes. Eur. Urol., 59(5):849–855, May 2011.
  60. Dense biased networks with deep priori anatomy and hard region adaptation: Semisupervised learning for fine renal artery segmentation. Medical Image Analysis, 63, 2020.
  61. Meta grayscale adaptive network for 3D integrated renal structures segmentation. Med. Image Anal., 71:102055, July 2021.
  62. Sdr-former: A siamese dual-resolution transformer for liver lesion classification using 3d multi-phase imaging. arXiv preprint arXiv:2402.17246, 2024.
  63. Mama-mia: A large-scale multi-center breast cancer dce-mri benchmark dataset with expert segmentations. arXiv preprint arXiv:2406.13844, 2024.
  64. ChestX-Ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, July 2017.
  65. ChestX-ray: Hospital-scale chest x-ray database and benchmarks on weakly supervised classification and localization of common thorax diseases. In Deep Learning and Convolutional Neural Networks for Medical Imaging and Clinical Informatics, Advances in computer vision and pattern recognition, pages 369–392. Springer International Publishing, Cham, 2019.
  66. Inference of captions from histopathological patches. In International Conference on Medical Imaging with Deep Learning, pages 1235–1250. PMLR, 2022.
  67. Large-scale pretraining on pathological images for fine-tuning of small pathological benchmarks. In Workshop on Medical Image Learning with Limited and Noisy Data, pages 257–267. Springer, 2023.
  68. Artificial intelligence for tumour tissue detection and histological regression grading in oesophageal adenocarcinomas: a retrospective algorithm development and validation study. The Lancet Digital Health, 5(5):e265–e275, 2023.
  69. Cellpose3: one-click image restoration for improved cellular segmentation. bioRxiv, pages 2024–02, 2024.
  70. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  71. CheXmask Database: a large-scale dataset of anatomical segmentation masks for chest x-ray images (version 0.1). https://doi.org/10.13026/dx54-8351, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Yunfei Xie (9 papers)
  2. Ce Zhou (11 papers)
  3. Lang Gao (14 papers)
  4. Juncheng Wu (11 papers)
  5. Xianhang Li (20 papers)
  6. Hong-Yu Zhou (50 papers)
  7. Sheng Liu (122 papers)
  8. Lei Xing (83 papers)
  9. James Zou (232 papers)
  10. Cihang Xie (91 papers)
  11. Yuyin Zhou (92 papers)
Citations (9)