M3D: Advancing 3D Medical Image Analysis with Multi-Modal Large Language Models (2404.00578v1)
Abstract: Medical image analysis is essential to clinical diagnosis and treatment, which is increasingly supported by multi-modal LLMs (MLLMs). However, previous research has primarily focused on 2D medical images, leaving 3D images under-explored, despite their richer spatial information. This paper aims to advance 3D medical image analysis with MLLMs. To this end, we present a large-scale 3D multi-modal medical dataset, M3D-Data, comprising 120K image-text pairs and 662K instruction-response pairs specifically tailored for various 3D medical tasks, such as image-text retrieval, report generation, visual question answering, positioning, and segmentation. Additionally, we propose M3D-LaMed, a versatile multi-modal LLM for 3D medical image analysis. Furthermore, we introduce a new 3D multi-modal medical benchmark, M3D-Bench, which facilitates automatic evaluation across eight tasks. Through comprehensive evaluation, our method proves to be a robust model for 3D medical image analysis, outperforming existing solutions. All code, data, and models are publicly available at: https://github.com/BAAI-DCAI/M3D.
- Quantification of uncertainties in biomedical image quantification challenge 2021. https://qubiq21.grand-challenge.org/. Accessed: 18 Aug 2023.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
- Vqa-med: Overview of the medical visual question answering task at imageclef 2019. In Proceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019, 2019.
- The liver tumor segmentation benchmark (lits). Medical Image Analysis, 84:102680, 2023.
- Shikra: Unleashing multimodal llm’s referential dialogue magic, 2023.
- Palm: Scaling language modeling with pathways, 2022.
- The cancer imaging archive (tcia): maintaining and operating a public information repository. Journal of digital imaging, 26:1045–1057, 2013.
- Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
- Palm-e: An embodied multimodal language model, 2023.
- Segvol: Universal and interactive volumetric medical image segmentation, 2023.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
- Dense biased networks with deep priori anatomy and hard region adaptation: Semi-supervised learning for fine renal artery segmentation. Medical image analysis, 63:101722, 2020.
- Meta grayscale adaptive network for 3d integrated renal structures segmentation. Medical image analysis, 71:102055, 2021.
- Comparison and evaluation of methods for liver segmentation from ct datasets. IEEE transactions on medical imaging, 28(8):1251–1265, 2009.
- The state of the art in kidney and kidney tumor segmentation in contrast-enhanced ct imaging: Results of the kits19 challenge. Medical Image Analysis, page 101821, 2020.
- The kits21 challenge: Automatic segmentation of kidneys, renal tumors, and renal cysts in corticomedullary-phase ct, 2023.
- Lora: Low-rank adaptation of large language models, 2021.
- Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation. arXiv preprint arXiv:2206.08023, 2022.
- Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019.
- Chaos challenge - combined (ct-mr) healthy abdominal organ segmentation. Medical Image Analysis, 69:101950, Apr. 2021.
- Comparison of semi-automatic and deep learning based automatic methods for liver segmentation in living liver transplant donors. Diagnostic and Interventional Radiology, 26:11–21, Jan. 2020.
- Chaos - combined (ct-mr) healthy abdominal organ segmentation challenge data. Apr. 2019.
- Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge. In Proc. MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge, volume 5, page 12, 2015.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. arXiv preprint arXiv:2306.00890, 2023.
- Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36, 2024.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
- A computed tomography vertebral segmentation dataset with anatomical variations and multi-vendor scanner data. Scientific data, 8(1):284, 2021.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
- Pmc-clip: Contrastive language-image pre-training using biomedical documents. arXiv preprint arXiv:2303.07240, 2023.
- Visual instruction tuning, 2023.
- Referring expression generation and comprehension via attributes. In Proceedings of the IEEE International Conference on Computer Vision, pages 4856–4864, 2017.
- A vertebral segmentation dataset with fracture grading. Radiology: Artificial Intelligence, 2(4):e190138, 2020.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- WORD: A large scale dataset, benchmark and clinical applicable study for abdominal organ segmentation from ct image. Medical Image Analysis, 82:102642, 2022.
- Unleashing the strengths of unlabeled data in pan-cancer abdominal organ quantification: the flare22 challenge. arXiv preprint arXiv:2308.05862, 2023.
- Abdomenct-1k: Is abdominal organ segmentation a solved problem? IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6695–6714, 2022.
- Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
- Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pages 353–367. PMLR, 2023.
- Gpt-4 technical report, 2023.
- OpenAI. ChatGPT: A generative pre-trained transformer for conversational agents. OpenAI Blog, 11 2019.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- A review of the application of multi-modal deep learning in medicine: Bibliometrics and future directions. International Journal of Computational Intelligence Systems, 16(1):44, 2023.
- Han-seg: The head and neck organ-at-risk ct and mr segmentation dataset. Medical physics, 50(3):1917–1927, 2023.
- Learning transferable visual models from natural language supervision, 2021.
- Ct-org: Ct volumes with multiple organ segmentations [dataset]. The Cancer Imaging Archive, 2019.
- Ct organ segmentation using gpu data augmentation, unsupervised labels and iou loss. arXiv preprint arXiv:1811.11226, 2018.
- Ct-org, a new dataset for multiple organ segmentation in computed tomography. Scientific Data, 7(1):381, 2020.
- Data from pancreas-ct. the cancer imaging archive. IEEE Transactions on Image Processing, 2016.
- Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part I 18, pages 556–564. Springer, 2015.
- Verse: a vertebrae labelling and segmentation benchmark for multi-detector ct images. Medical image analysis, 73:102166, 2021.
- Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge. Medical image analysis, 42:1–13, 2017.
- Laparoscopic partial nephrectomy with segmental renal artery clamping: technique and clinical outcomes. European urology, 59(5):849–855, 2011.
- Precise segmental renal artery clamping under the guidance of dual-source computed tomography angiography during laparoscopic partial nephrectomy. European urology, 62(6):1001–1008, 2012.
- A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063, 2019.
- 3d image reconstruction for comparison of algorithm database. URL: https://www. ircad. fr/research/data-sets/liver-segmentation-3d-ircadb-01, 2010.
- Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Towards generalist biomedical ai. NEJM AI, 1(3):AIoa2300138, 2024.
- Totalsegmentator: Robust segmentation of 104 anatomic structures in ct images. Radiology: Artificial Intelligence, 5(5), 2023.
- Totalsegmentator: Robust segmentation of 104 anatomical structures in ct images 2022. arXiv, 2022.
- Towards generalist foundation model for radiology. arXiv preprint arXiv:2308.02463, 2023.
- Towards generalist foundation model for radiology by leveraging web-scale 2d & 3d medical data, 2023.
- Sigmoid loss for language image pre-training, 2023.
- Biomedgpt: A unified and generalist biomedical generative pre-trained transformer for vision, language, and multimodal tasks, 2024.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
- Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Fan Bai (38 papers)
- Yuxin Du (8 papers)
- Tiejun Huang (130 papers)
- Max Q. -H. Meng (79 papers)
- Bo Zhao (242 papers)