Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains (2210.04133v1)

Published 9 Oct 2022 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Multi-modal foundation models are typically trained on millions of pairs of natural images and text captions, frequently obtained through web-crawling approaches. Although such models depict excellent generative capabilities, they do not typically generalize well to specific domains such as medical images that have fundamentally shifted distributions compared to natural images. Building generative models for medical images that faithfully depict clinical context may help alleviate the paucity of healthcare datasets. Thus, in this study, we seek to research and expand the representational capabilities of large pretrained foundation models to medical concepts, specifically for leveraging the Stable Diffusion model to generate domain specific images found in medical imaging. We explore the sub-components of the Stable Diffusion pipeline (the variational autoencoder, the U-Net and the text-encoder) to fine-tune the model to generate medical images. We benchmark the efficacy of these efforts using quantitative image quality metrics and qualitative radiologist-driven evaluations that accurately represent the clinical content of conditional text prompts. Our best-performing model improves upon the stable diffusion baseline and can be conditioned to insert a realistic-looking abnormality on a synthetic radiology image, while maintaining a 95% accuracy on a classifier trained to detect the abnormality.

PDF Abstract

Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains

In the domain of artificial intelligence, the adaptation of pretrained vision-language foundational models to tackle specific challenges in medical imaging represents a sophisticated frontier of research. The paper explores Stable Diffusion, a latent diffusion model primarily trained on natural images and captions, and evaluates its potential adaptations for the generation of synthetic medical images, particularly chest X-rays (CXRs).

The inherent challenge addressed in this paper is the domain shift between natural and medical images, which affects the fidelity and applicability of image generation models like Stable Diffusion to clinical contexts. The authors propose and assess several methodologies for micro-modifying components of this model to improve its capacity to generate clinically accurate medical images.

Methodology and Experiments

Central to this research are the components of the Stable Diffusion pipeline: the variational autoencoder (VAE), U-Net architecture, and CLIP text encoder. The authors meticulously evaluate the representational capacity of each component in the medical domain:

Variational Autoencoder (VAE): The evaluation determines the VAE's capability to reconstruct CXRs without fine-tuning, concluding that it successfully preserves clinically significant features, evidenced by metrics such as RMSE, PSNR, and SSIM.
Text Encoder: The paper explores several in-domain text encoders (e.g., PubMedBERT, ClinicalBERT, and SapBERT) against CLIP in encoding medical text prompts. Notably, CLIP, despite being trained on natural language, produces surprisingly competitive embeddings that represent medical concepts sufficiently well.
Textual Projection and Fine-Tuning: Experiments on textual projection, which attempt to integrate in-domain encoders with the CLIP latent space, suggested limited success, often leading to degraded image quality. Conversely, fine-tuning approaches, such as textual inversion and U-Net fine-tuning, demonstrated potential, particularly U-Net fine-tuning, which significantly enhanced the fidelity of synthetic CXRs.

Evaluation through Synthetic Image Classification

The practical application of these synthetic images was tested using a thoracic radiologist and a DenseNet-121 classifier. U-Net fine-tuning, especially with prior, notably improved classification AUC scores for detecting pleural effusion in synthetic images, highlighting its efficacy in generating clinically useful outputs.

Implications and Future Research Directions

The implications of this research are manifold. The generation of diverse, high-fidelity synthetic datasets from existing models can mitigate the scarcity of labeled medical data, which is a major barrier in training robust AI models for medical applications. This advancement could lead to cost-effective augmentation of datasets critical for supervised learning tasks, potentially accelerating the development of AI applications in diagnostics.

Future research should focus on enhancing the robustness and diversity of synthetic image generation, expanding beyond CXRs to other imaging modalities. Efforts to refine domain-specific metrics for evaluating synthetic images are essential. Additionally, integrating more nuanced clinical reports as conditioning inputs could elevate the representational accuracy and applicability of generated images.

In conclusion, the paper provides a substantial contribution to the integration of vision-LLMs in medical imaging, paving the way for future research to build upon these foundations with broader applications in healthcare AI.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Pierre Chambon (7 papers)
Christian Bluethgen (20 papers)
Curtis P. Langlotz (23 papers)
Akshay Chaudhari (34 papers)

Citations (92)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos