Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Advancing Multimodal Medical Capabilities of Gemini (2405.03162v1)

Published 6 May 2024 in cs.CV, cs.AI, cs.CL, and cs.LG
Advancing Multimodal Medical Capabilities of Gemini

Abstract: Many clinical tasks require an understanding of specialized data, such as medical images and genomics, which is not typically found in general-purpose large multimodal models. Building upon Gemini's multimodal models, we develop several models within the new Med-Gemini family that inherit core capabilities of Gemini and are optimized for medical use via fine-tuning with 2D and 3D radiology, histopathology, ophthalmology, dermatology and genomic data. Med-Gemini-2D sets a new standard for AI-based chest X-ray (CXR) report generation based on expert evaluation, exceeding previous best results across two separate datasets by an absolute margin of 1% and 12%, where 57% and 96% of AI reports on normal cases, and 43% and 65% on abnormal cases, are evaluated as "equivalent or better" than the original radiologists' reports. We demonstrate the first ever large multimodal model-based report generation for 3D computed tomography (CT) volumes using Med-Gemini-3D, with 53% of AI reports considered clinically acceptable, although additional research is needed to meet expert radiologist reporting quality. Beyond report generation, Med-Gemini-2D surpasses the previous best performance in CXR visual question answering (VQA) and performs well in CXR classification and radiology VQA, exceeding SoTA or baselines on 17 of 20 tasks. In histopathology, ophthalmology, and dermatology image classification, Med-Gemini-2D surpasses baselines across 18 out of 20 tasks and approaches task-specific model performance. Beyond imaging, Med-Gemini-Polygenic outperforms the standard linear polygenic risk score-based approach for disease risk prediction and generalizes to genetically correlated diseases for which it has never been trained. Although further development and evaluation are necessary in the safety-critical medical domain, our results highlight the potential of Med-Gemini across a wide range of medical tasks.

Exploring the Potential of Multimodal AI in Medicine with Med-Gemini Models

Introduction

The integration of AI in medicine has progressively moved from theory to application, significantly impacting how medical data is understood and utilized. The advent of multimodal AI solutions, which can process diverse data types including medical images and genetic information, begins to reflect the multifaceted nature of human health.

Unlocking Multimodal Capabilities in Medical AI

Advanced Multimodal Models: The recent development of large multimodal models (LMMs) like Gemini has demonstrated superb capabilities in handling complex data including text, images, and more. This technological leap holds profound implications for personalized medicine, where multifaceted data is paramount.

Med-Gemini Family Introduction: Building on the foundation provided by Gemini models, the Med-Gemini family was specifically tailored for medical applications. By integrating varied medical data types—radiology, pathology, genomics, and beyond—these models aim to approach the complexity of clinical diagnostics and patient treatment planning.

Deep Dive into Med-Gemini's Performance

Versatile Medical Task Handling: Med-Gemini models have shown promise across several key areas in healthcare AI, from generating medical reports based on imaging to answering complex clinical questions regarding patient data visuals.

  • Radiology Reports: Notably, Med-Gemini excels in generating interpretative reports from both 2D and 3D medical imaging, such as chest X-rays and head/neck CT scans. These capabilities extend beyond generating text to actually understanding and summarizing critical medical findings.
  • Disease Prediction Using Genetic Data: Leaping into genomics, Med-Gemini applies a novel approach by translating genetic risk information into a visual format which the model can then interpret, predicting potential disease risks with notable accuracy.
  • Diagnostic Assistance Through QA: In visual question answering (VQA) tasks, Med-Gemini efficiently handles queries related to medical imagery, allowing it to support healthcare professionals by providing immediate insights into patient data.

Implications and Future Directions

Broadening the AI Application in Medicine: The results indicate that Med-Gemini can serve as a robust auxiliary tool for various medical specialists, from radiologists needing quick report generation to geneticists assessing disease susceptibility.

Future Enhancements: Despite its current capabilities, there are still several areas requiring improvement and careful consideration before full clinical deployment. These include validating AI performance in real-world settings and ensuring the models generalize well across different patient demographics and conditions.

Clinical Integration and Safety Evaluations: Before these models can be fully integrated into clinical workflows, extensive testing and validation are needed to address any potential safety issues, ensuring that the AI's recommendations are reliable and enhance patient care.

Conclusion

The introduction of Med-Gemini signifies a crucial step forward in applying AI within the medical field. By efficiently processing and interpreting complex multimodal medical data, these models hint at a future where AI not only supports but enhances clinical decision-making processes. As development continues, the focus will remain on refining these models to ensure they meet the stringent requirements of medical application, aiming for a future where AI and healthcare professionals work hand in hand to improve patient outcomes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (130)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Multimodal biomedical ai. Nature Medicine, 28(9):1773–1784, 2022.
  3. Association of artificial intelligence–aided chest radiograph interpretation with reader performance and efficiency. JAMA Network Open, 5(8):e2229289–e2229289, 2022.
  4. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  5. Multi-pgs enhances polygenic prediction by combining 937 polygenic scores. Nature Communications, 14(4702), 2023.
  6. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  7. End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nature Medicine, 25(6):954–961, 2019. ISSN 1546-170X. 10.1038/s41591-019-0447-x. URL https://doi.org/10.1038/s41591-019-0447-x.
  8. Big self-supervised models advance medical image classification. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3478–3488, 2021.
  9. Robust and data-efficient generalization of self-supervised machine learning for diagnostic imaging. Nature Biomedical Engineering, 7(6):756–779, 2023.
  10. Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15016–15027, 2023.
  11. Pathways: Asynchronous distributed dataflow for ml. Proceedings of Machine Learning and Systems, 4:430–449, 2022.
  12. Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama, 318(22):2199–2210, 2017.
  13. Multimodal llms for health grounded in individual-specific data. In Workshop on Machine Learning for Multimodal Healthcare Data, pages 86–102. Springer, 2023.
  14. Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation in the Medical Domain. In CLEF 2021 Working Notes, CEUR Workshop Proceedings, Bucharest, Romania, September 21-24 2021. CEUR-WS.org.
  15. Stanford crfm introduces pubmedgpt 2.7b. https://hai.stanford.edu/news/stanford-crfm-introduces-pubmedgpt-27b, 2022.
  16. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  17. The UK biobank resource with deep phenotyping and genomic data. Nature, 562(7726):203–209, October 2018.
  18. PaLi: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  19. Generating radiology reports via memory-driven transformer. arXiv preprint arXiv:2010.16056, 2020.
  20. Tutorial: a guide to performing polygenic risk score analyses. Nature Protocols, 15(9):2759–2772, 2020.
  21. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  22. Inference of chronic obstructive pulmonary disease with deep learning on raw spirograms identifies new genetic loci and improves risk models. Nature Genetics, 55(5):787–795, 2023.
  23. Eyepacs: an adaptable telemedicine system for diabetic retinopathy screening. Journal of diabetes science and technology, 3(3):509–516, 2009.
  24. scGPT: toward building a foundation model for single-cell multi-omics using generative ai. Nature Methods, pages 1–11, 2024.
  25. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  26. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  27. Retrieval-based chest x-ray report generation using a pre-trained contrastive language-image model. In Machine Learning for Health, pages 209–219. PMLR, 2021.
  28. Autonomous artificial intelligence agents for clinical decision making in oncology. arXiv preprint arXiv:2404.04667, 2024.
  29. Medalign: A clinician-generated dataset for instruction following with electronic medical records. arXiv preprint arXiv:2308.14089, 2023.
  30. Gemini Team, Google. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  31. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation, 101(23):e215–e220, 2000.
  32. Google. Google’s foundation model for Dermatology. https://github.com/Google-Health/imaging-research/tree/master/derm-foundation, 2024. Accessed April 18, 2024.
  33. Gemini Team Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  34. Digital knee x-ray images. Mendeley Data, 1, 2020.
  35. Ct2rep: Automated radiology report generation for 3d medical imaging. arXiv preprint arXiv:2403.06801, 2024.
  36. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247, 2023.
  37. PathVQA: 30000+ Questions for Medical Visual Question Answering. arXiv preprint arXiv:2003.10286, 2020.
  38. A visual–language foundation model for pathology image analysis using medical twitter. Nature medicine, 29(9):2307–2316, 2023.
  39. Maira-1: A specialised large multimodal model for radiology report generation. arXiv preprint arXiv:2311.13668, 2023.
  40. Quilt-1m: One million image-text pairs for histopathology. Advances in Neural Information Processing Systems, 36, 2024.
  41. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019.
  42. Radgraph: Extracting clinical entities and relations from radiology reports. arXiv preprint arXiv:2106.14463, 2021.
  43. Deep learning models for histologic grading of breast cancer and association with disease prognosis. NPJ breast cancer, 8(1):113, 2022.
  44. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  45. MIMIC-CXR database (version 2.0. 0). PhysioNet, 2019a.
  46. Mimic-cxr-jpg - chest radiographs with structured labels, November 2019b. URL https://doi.org/10.13026/8360-t248.
  47. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019c.
  48. Genome-wide analyses identify 68 new loci associated with intraocular pressure and improve risk prediction for primary open-angle glaucoma. Nature Genetics, 50(6):778–782, 2018.
  49. Assistive ai in lung cancer screening: A retrospective multinational study in the united states and japan. Radiology: Artificial Intelligence, 2024. 10.1148/ryai.230079. URL https://pubs.rsna.org/doi/10.1148/ryai.230079.
  50. Grader variability and the importance of reference standards for evaluating machine learning models for diabetic retinopathy. Ophthalmology, 125(8):1264–1272, 2018.
  51. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
  52. Biomistral: A collection of open-source pretrained large language models for medical domains. arXiv preprint arXiv:2402.10373, 2024.
  53. Domain-specific optimization and diverse evaluation of self-supervised models for histopathology. arXiv preprint arXiv:2310.13259, 2023.
  54. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018.
  55. Elevater: A benchmark and toolkit for evaluating language-augmented visual models. Advances in Neural Information Processing Systems, 35:9287–9301, 2022.
  56. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36, 2024.
  57. Masked vision and language pre-training with unimodal and multimodal contrastive losses for medical visual question answering. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 374–383. Springer, 2023a.
  58. Self-supervised vision-language pretraining for medial visual question answering. In 2023 IEEE 20th International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2023b.
  59. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  60. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650–1654. IEEE, 2021.
  61. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  62. Visual instruction tuning. Advances in neural information processing systems, 36, 2024a.
  63. Radiology-gpt: A large language model for radiology, 2024b.
  64. A visual-language foundation model for computational pathology. Nature Medicine, pages 1–12, 2024.
  65. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics, 23(6):bbac409, 2022.
  66. Chest radiograph interpretation with deep learning models: assessment with radiologist-adjudicated reference standards and population-adjusted evaluation. Radiology, 294(2):421–431, 2020.
  67. Improving factual completeness and consistency of image-to-text radiology report generation. arXiv preprint arXiv:2010.10042, 2020.
  68. Foundation models for generalist medical artificial intelligence. Nature, 616(7956):259–265, 2023a.
  69. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR, 2023b.
  70. Deep learning for distinguishing normal versus abnormal chest radiographs and generalization to two unseen diseases tuberculosis and covid-19. Scientific reports, 11(1):15523, 2021.
  71. Development and validation of a deep learning algorithm for improving gleason scoring of prostate cancer. NPJ digital medicine, 2(1):48, 2019.
  72. Development and validation of a deep learning algorithm for gleason grading of prostate cancer from biopsy specimens. JAMA oncology, 6(9):1372–1380, 2020.
  73. HyenaDNA: Long-range genomic sequence modeling at single nucleotide resolution. Advances in neural information processing systems, 36, 2024.
  74. NHS. Rcr response to nhse data release on diagnostic imaging times, 2024. URL https://www.rcr.ac.uk/news-policy/latest-updates/rcr-response-to-nhse-data-release-on-diagnostic-imaging-times/.
  75. Improving chest x-ray report generation by leveraging warm starting. Artificial intelligence in medicine, 144:102633, 2023.
  76. NLST. National Lung Screening Trial (NLST). https://www.cancer.gov/types/lung/research/nlst, 2014. Accessed September 15, 2021.
  77. OpenAI. GPT-4V(ision) Technical Work and Authors. Technical report, OpenAI, 2023. URL https://cdn.openai.com/contributions/gpt-4v.pdf.
  78. PAD-UFES-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones. Data in brief, 32:106221, 2020.
  79. Gemini goes to med school: Exploring the capabilities of multimodal large language models on medical challenge problems & hallucinations. arXiv preprint arXiv:2402.07023, 2024.
  80. Pan-UKB team. https://pan.ukbb.broadinstitute.org, 2020.
  81. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  82. MIMIC-CXR-GT Database (version 1.0.0), 2024. In preparation.
  83. Image transformer. In International conference on machine learning, pages 4055–4064. PMLR, 2018.
  84. Biot5: Enriching cross-modal integration in biology with chemical knowledge and natural language associations. arXiv preprint arXiv:2310.07276, 2023.
  85. Hyena hierarchy: Towards larger convolutional language models. In International Conference on Machine Learning, pages 28043–28078. PMLR, 2023.
  86. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  87. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  88. Ai in health and medicine. Nature medicine, 28(1):31–38, 2022.
  89. Multimedeval: A benchmark and a toolkit for evaluating medical vision-language models. arXiv preprint arXiv:2402.09262, 2024.
  90. Capabilities of gemini models in medicine. arXiv preprint arXiv:2404.18416, 2024.
  91. Comparative analysis of machine learning approaches to classify tumor mutation burden in lung adenocarcinoma using histopathology images. Scientific reports, 11(1):16605, 2021.
  92. The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Physical therapy, 85(3):257–268, 2005.
  93. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023a.
  94. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023b.
  95. How to train your vit? data, augmentation, and regularization in vision transformers. arXiv preprint arXiv:2106.10270, 2021.
  96. UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Med., 12(3):e1001779, March 2015.
  97. R Summers. Nih chest x-ray dataset of 14 common thorax disease categories. NIH Clinical Center: Bethesda, MD, USA, 2019.
  98. Pathasst: A generative foundation ai assistant towards artificial general intelligence of pathology. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38 (5), pages 5034–5042, 2024.
  99. Interactive and explainable region-guided radiology report generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7433–7442, 2023.
  100. Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation. arXiv preprint arXiv:2311.18260, 2024.
  101. Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv:2306.07971, 2023.
  102. Expert-level detection of pathologies from unannotated chest x-ray images via self-supervised learning. Nature Biomedical Engineering, 6(12):1399–1406, 2022.
  103. Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031, 2023.
  104. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  105. Integrative polygenic risk score improves the prediction accuracy of complex traits and diseases. Cell Genomics, 4, 2024.
  106. Towards generalist biomedical AI. NEJM AI, 1(3):AIoa2300138, 2024.
  107. No" zero-shot" without exponential data: Pretraining concept frequency determines multimodal model performance. arXiv preprint arXiv:2404.04125, 2024.
  108. Open-ended medical visual question answering through prefix tuning of language models. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 726–736. Springer, 2023.
  109. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  110. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  111. Heritability in the genomics era—concepts and misconceptions. Nature Reviews Genetics, 9(4):255–266, 2008.
  112. Vl-taboo: An analysis of attribute-based zero-shot capabilities of vision-language models. arXiv preprint arXiv:2209.06103, 2022.
  113. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106, 2017.
  114. METransformer: Radiology report generation by transformer with multiple learnable expert tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11558–11567, 2023a.
  115. R2GenGPT: Radiology report generation with frozen LLMs. arXiv preprint arXiv:2309.09812, 2023b.
  116. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  117. Multimodal multitask representation learning for pathology biobank metadata prediction. CoRR, abs/1909.07846, 2019.
  118. An intentional approach to managing bias in general purpose embedding models. The Lancet Digital Health, 6(2):e126–e130, 2024.
  119. Zero-shot clinical trial patient matching with llms. arXiv preprint arXiv:2402.05125, 2024.
  120. Towards generalist foundation model for radiology. arXiv preprint arXiv:2308.02463, 2023.
  121. Interpretable survival prediction for colorectal cancer using deep learning. NPJ digital medicine, 4(1):71, 2021.
  122. ELIXR: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arXiv preprint arXiv:2308.01317, 2023.
  123. Weakly Supervised Contrastive Learning for Chest X-Ray Report Generation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4009–4015, 2021.
  124. VideoCoCa: Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979, 2022.
  125. Evaluating progress in automatic chest x-ray radiology report generation. Patterns, 4(9), 2023.
  126. CoCa: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  127. Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks. arXiv preprint arXiv:2305.17100, 2023a.
  128. PMC-VQA: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415, 2023b.
  129. Utilizing multimodal ai to improve genetic analyses of cardiovascular traits. medRxiv, pages 2024–03, 2024.
  130. Protllm: An interleaved protein-language llm with protein-as-word pre-training. arXiv preprint arXiv:2403.07920, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (47)
  1. Lin Yang (212 papers)
  2. Shawn Xu (6 papers)
  3. Andrew Sellergren (8 papers)
  4. Timo Kohlberger (5 papers)
  5. Yuchen Zhou (38 papers)
  6. Ira Ktena (14 papers)
  7. Atilla Kiraly (3 papers)
  8. Faruk Ahmed (17 papers)
  9. Farhad Hormozdiari (5 papers)
  10. Tiam Jaroensri (3 papers)
  11. Eric Wang (34 papers)
  12. Ellery Wulczyn (14 papers)
  13. Fayaz Jamil (3 papers)
  14. Theo Guidroz (2 papers)
  15. Chuck Lau (3 papers)
  16. Siyuan Qiao (40 papers)
  17. Yun Liu (213 papers)
  18. Akshay Goel (4 papers)
  19. Kendall Park (3 papers)
  20. Arnav Agharwal (1 paper)
Citations (34)
Youtube Logo Streamline Icon: https://streamlinehq.com