Merlin: A Vision Language Foundation Model for 3D Computed Tomography (2406.06512v1)

Published 10 Jun 2024 in cs.CV and cs.AI

Abstract: Over 85 million computed tomography (CT) scans are performed annually in the US, of which approximately one quarter focus on the abdomen. Given the current radiologist shortage, there is a large impetus to use artificial intelligence to alleviate the burden of interpreting these complex imaging studies. Prior state-of-the-art approaches for automated medical image interpretation leverage vision LLMs (VLMs). However, current medical VLMs are generally limited to 2D images and short reports, and do not leverage electronic health record (EHR) data for supervision. We introduce Merlin - a 3D VLM that we train using paired CT scans (6+ million images from 15,331 CTs), EHR diagnosis codes (1.8+ million codes), and radiology reports (6+ million tokens). We evaluate Merlin on 6 task types and 752 individual tasks. The non-adapted (off-the-shelf) tasks include zero-shot findings classification (31 findings), phenotype classification (692 phenotypes), and zero-shot cross-modal retrieval (image to findings and image to impressions), while model adapted tasks include 5-year disease prediction (6 diseases), radiology report generation, and 3D semantic segmentation (20 organs). We perform internal validation on a test set of 5,137 CTs, and external validation on 7,000 clinical CTs and on two public CT datasets (VerSe, TotalSegmentator). Beyond these clinically-relevant evaluations, we assess the efficacy of various network architectures and training strategies to depict that Merlin has favorable performance to existing task-specific baselines. We derive data scaling laws to empirically assess training data needs for requisite downstream task performance. Furthermore, unlike conventional VLMs that require hundreds of GPUs for training, we perform all training on a single GPU.

Citations (11)

View on Semantic Scholar

Summary

The paper presents Merlin, a 3D vision-language model that integrates CT scans with structured EHR and radiology reports to overcome traditional 2D constraints.
It demonstrates computational efficiency by training on a single GPU using over 15,000 CT scans and millions of clinical data points.
Strong performance is achieved across tasks with a zero-shot findings F1 score of 0.741 and multi-disease prediction AUROC up to 0.757, outperforming existing baselines.

Overview of "Merlin: A Vision Language Foundation Model for 3D Computed Tomography"

The paper "Merlin: A Vision Language Foundation Model for 3D Computed Tomography" by Blankemeier et al., introduces Merlin, a vision-LLM specifically designed for interpreting 3D computed tomography (CT) scans, with a primary focus on abdominal CTs. This manuscript addresses significant limitations in current vision-LLMs (VLMs) applied to medical imaging, such as their restriction to 2D images and short text reports, and the lack of integration with electronic health record (EHR) data for supervision.

Key Contributions

Training Strategy and Computational Efficiency:
- The authors developed Merlin, a 3D vision-language foundation model leveraging both structured EHR and unstructured radiology reports. Merlin integrates high-quality clinical data covering 15,331 CTs, 1.8+ million EHR codes, and 6+ million tokens from radiology reports.
- Notably, the model training is computationally efficient, performed on a single GPU, demonstrating that large-scale model training can be accessible even to institutions with limited computational resources.
Comprehensive Evaluation:
- Merlin was evaluated on six different task types covering 752 individual tasks without model adaptation: zero-shot findings classification, phenotype classification, and zero-shot cross-modal retrieval. It also included tasks requiring model adaptation, such as 5-year chronic disease prediction, radiology report generation, and 3D semantic segmentation.
Strong Performance Metrics:
- Zero-shot findings classification: Achieved an F1 score of 0.741 internally and 0.647 externally, outperforming established baselines like OpenCLIP and BioMedCLIP.
- Phenotype classification: Demonstrated a macro-AUROC of 0.812 across 692 phenotypes.
- Multi-disease prediction: Demonstrated an average AUROC of 0.757 with 100% of training data and 0.708 with 10% of training data, significantly outperforming ImageNet pretrained models.
- Radiology report generation: Outperformed RadFM across multiple metrics, including RadGraph-F1, BERT Score, ROUGE-2, and BLEU score.
Innovative Techniques:
- Used I3D initialization for reusing 2D ImageNet pretrained weights within the 3D model.
- Implemented report splitting to segment radiology findings into anatomical sections for more precise supervision.
- Applied multi-task learning with EHR and radiology reports, outperforming staged training in model effectiveness.

Implications and Future Directions

The implications of this research are multifaceted:

Clinical Utility: Merlin can significantly aid radiologists by flagging missed findings, speeding up interpretation workflows, and assisting in generating structured radiology reports.
Future Research: The data scaling laws presented can guide future studies on the required scale of pretraining datasets to achieve desired performance levels. This facilitates the extension of Merlin to other anatomies and modalities, potentially broadening its application scope.
Training Paradigm: The paper underscores the benefits of integrating structured and unstructured clinical data, advocating for multi-task learning techniques to maximize model performance. The computational efficiency highlights a path to democratize AI training in healthcare, making high-performance models accessible to institutions with limited computational resources.

Conclusion

Blankemeier et al.'s work on Merlin presents a comprehensive approach to enhance the interpretation of abdominal CT scans through the integration of vision-LLMs with clinical data. While demonstrating strong results across various tasks and evaluating multiple model architectures and training strategies, the paper lays the groundwork for future advancements in medical imaging AI. The release of the trained models, code, and dataset following PHI removal will further catalyze research and development in this crucial area of healthcare technology.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Dr_ASChaudhari/status/1800743505180168407

https://twitter.com/gm8xx8/status/1800358504105910673

YouTube

Show All Videos