Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 91 tok/s
Gemini 3.0 Pro 55 tok/s
Gemini 2.5 Flash 173 tok/s Pro
Kimi K2 194 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Interpretable Bilingual Multimodal Large Language Model for Diverse Biomedical Tasks (2410.18387v4)

Published 24 Oct 2024 in cs.CV

Abstract: Several medical Multimodal Large Languange Models (MLLMs) have been developed to address tasks involving visual images with textual instructions across various medical modalities, achieving impressive results. Most current medical generalist models are region-agnostic, treating the entire image as a holistic representation. However, they struggle to identify which specific regions they are focusing on when generating a sentence. To mimic the behavior of doctors, who typically begin by reviewing the entire image before concentrating on specific regions for a thorough evaluation, we aim to enhance the capability of medical MLLMs in understanding anatomical regions within entire medical scans. To achieve it, we first formulate Region-Centric tasks and construct a large-scale dataset, MedRegInstruct, to incorporate regional information into training. Combining our collected dataset with other medical multimodal corpora for training, we propose a Region-Aware medical MLLM, MedRegA, which is the first bilingual generalist medical AI system to simultaneously handle image-level and region-level medical vision-language tasks across a broad range of modalities. Our MedRegA not only enables three region-centric tasks, but also achieves the best performance for visual question answering, report generation and medical image classification over 8 modalities, showcasing significant versatility. Experiments demonstrate that our model can not only accomplish powerful performance across various medical vision-language tasks in bilingual settings, but also recognize and detect structures in multimodal medical scans, boosting the interpretability and user interactivity of medical MLLMs. Our project page is https://medrega.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Dataset of breast ultrasound images. Data in brief, 28:104863, 2020.
  2. Minigpt-med: Large language model as a general interface for radiology diagnosis. arXiv preprint arXiv:2407.04106, 2024.
  3. M3d: Advancing 3d medical image analysis with multi-modal large language models. arXiv preprint arXiv:2404.00578, 2024.
  4. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. 2023.
  5. Maira-2: Grounded radiology report generation. arXiv preprint arXiv:2406.04449, 2024.
  6. ButterflyNetworkInc. Butterfly network ultrasound dataset. Website, 2018. https://github.com/ButterflyNetwork/MITGrandHack2018.
  7. Internlm2 technical report. arXiv preprint arXiv:2403.17297, 2024.
  8. Training small multimodal models to bridge biomedical competency gap: A case study in radiology imaging. arXiv preprint arXiv:2403.08002, 2024.
  9. Huatuogpt-vision, towards injecting medical visual knowledge into multimodal llms at scale. arXiv preprint arXiv:2406.19280, 2024a.
  10. Uscl: pretraining deep ultrasound image diagnosis model through video contrastive representation learning. In Medical image computing and computer assisted intervention–MICCAI 2021: 24th International conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part VIII 24, pp.  627–637. Springer, 2021.
  11. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024b.
  12. Generating radiology reports via memory-driven transformer. arXiv preprint arXiv:2010.16056, 2020.
  13. Chexagent: Towards a foundation model for chest x-ray interpretation. arXiv preprint arXiv:2401.12208, 2024c.
  14. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368, 2019.
  15. Bcn20000: Dermoscopic lesions in the wild. arXiv preprint arXiv:1908.02288, 2019.
  16. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association, 23(2):304–310, 2016.
  17. Pannuke: an open pan-cancer histology dataset for nuclei instance segmentation and classification. In Digital Pathology: 15th European Congress, ECDP 2019, Warwick, UK, April 10–13, 2019, Proceedings 15, pp.  11–19. Springer, 2019.
  18. Regiongpt: Towards region understanding vision language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13796–13806, 2024.
  19. A foundation model utilizing chest ct volumes and radiology reports for supervised-level zero-shot detection of abnormalities. arXiv preprint arXiv:2403.17834, 2024.
  20. Meddr: Diagnosis-guided bootstrapping for large-scale medical vision-language learning. arXiv preprint arXiv:2404.15127, 2024.
  21. Pathological visual question answering. arXiv preprint arXiv:2010.12435, 2020.
  22. How does pruning impact long-tailed multi-label medical image classifiers? In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp.  663–673. Springer, 2023.
  23. A refer-and-ground multimodal large language model for biomedicine. arXiv preprint arXiv:2406.18146, 2024.
  24. Maira-1: A specialised large multimodal model for radiology report generation. arXiv preprint arXiv:2311.13668, 2023.
  25. Quilt-1m: One million image-text pairs for histopathology. Advances in neural information processing systems, 36, 2024.
  26. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pp.  590–597, 2019.
  27. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019.
  28. Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLoS medicine, 16(1):e1002730, 2019.
  29. Identifying medical diagnoses and treatable diseases by image-based deep learning. cell, 172(5):1122–1131, 2018.
  30. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1–10, 2018.
  31. Autorg-brain: Grounded report generation for brain mri. arXiv preprint arXiv:2407.16684, 2024.
  32. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36, 2024.
  33. A benchmark of ocular disease intelligent recognition: One shot for multi-disease detection. In Benchmarking, Measuring, and Optimizing: Third BenchCouncil International Symposium, Bench 2020, Virtual Event, November 15–16, 2020, Revised Selected Papers 3, pp.  177–193. Springer, 2021.
  34. Pmc-clip: Contrastive language-image pre-training using biomedical documents. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp.  525–536. Springer, 2023.
  35. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pp.  1650–1654. IEEE, 2021.
  36. Visual instruction tuning. Advances in neural information processing systems, 36, 2024.
  37. Qilin-med-vl: Towards chinese large vision-language model for general healthcare. arXiv preprint arXiv:2310.17956, 2023.
  38. A foundational multimodal vision language ai assistant for human pathology. arXiv preprint arXiv:2312.07814, 2023.
  39. Med-flamingo: a multimodal medical few-shot learner. In Machine Learning for Health (ML4H), pp.  353–367. PMLR, 2023.
  40. A brazilian multilabel ophthalmological dataset (brset). PhysioNet https://doi. org/10, 13026, 2023.
  41. Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations. Scientific Data, 9(1):429, 2022.
  42. Vindr-spinexr: A deep learning framework for spinal lesions detection and classification from radiographs. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part V 24, pp.  291–301. Springer, 2021.
  43. Vindr-mammo: A large-scale benchmark dataset for computer-aided diagnosis in full-field digital mammography. Scientific Data, 10(1):277, 2023.
  44. OpenAI. Gpt-4v(ision) system card, 2023. URL https://cdn.openai.com/papers/GPTV_System_Card.pdf.
  45. Pad-ufes-20: A skin lesion dataset composed of patient data and clinical images collected from smartphones. Data in brief, 32:106221, 2020.
  46. Radialog: A large vision-language model for radiology report generation and conversational assistance. arXiv preprint arXiv:2311.18681, 2023.
  47. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  48. Vindr-pcxr: An open, large-scale pediatric chest x-ray dataset for interpretation of common thoracic diseases. PhysioNet (version 1.0. 0), 10:2, 2022.
  49. Mura: Large dataset for abnormality detection in musculoskeletal radiographs. arXiv preprint arXiv:1712.06957, 2017.
  50. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  51. Xraygpt: Chest radiographs summarization using medical vision-language models. arXiv preprint arXiv:2306.07971, 2023.
  52. Towards generalist biomedical ai. NEJM AI, 1(3):AIoa2300138, 2024.
  53. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907, 2023.
  54. Towards generalist foundation model for radiology. arXiv preprint arXiv:2308.02463, 2023.
  55. Chest imagenome dataset for clinical reasoning. arXiv preprint arXiv:2108.00316, 2021.
  56. Medtrinity-25m: A large-scale multimodal dataset with multigranular annotations for medicine. arXiv preprint arXiv:2408.02900, 2024.
  57. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
  58. Sa-med2d-20m dataset: Segment anything in 2d medical imaging with 20 million masks. arXiv preprint arXiv:2311.11969, 2023.
  59. A generalist vision–language foundation model for diverse biomedical tasks. Nature Medicine, pp.  1–13, 2024a.
  60. Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415, 2023a.
  61. Radgenome-chest ct: A grounded vision-language dataset for chest ct analysis. arXiv preprint arXiv:2404.16754, 2024b.
  62. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023b.
  63. A generalist learner for multifaceted medical image interpretation. arXiv preprint arXiv:2405.07988, 2024.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.