Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RoentGen: Vision-Language Foundation Model for Chest X-ray Generation (2211.12737v1)

Published 23 Nov 2022 in cs.CV, cs.AI, cs.CL, and cs.LG
RoentGen: Vision-Language Foundation Model for Chest X-ray Generation

Abstract: Multimodal models trained on large natural image-text pair datasets have exhibited astounding abilities in generating high-quality images. Medical imaging data is fundamentally different to natural images, and the language used to succinctly capture relevant details in medical data uses a different, narrow but semantically rich, domain-specific vocabulary. Not surprisingly, multi-modal models trained on natural image-text pairs do not tend to generalize well to the medical domain. Developing generative imaging models faithfully representing medical concepts while providing compositional diversity could mitigate the existing paucity of high-quality, annotated medical imaging datasets. In this work, we develop a strategy to overcome the large natural-medical distributional shift by adapting a pre-trained latent diffusion model on a corpus of publicly available chest x-rays (CXR) and their corresponding radiology (text) reports. We investigate the model's ability to generate high-fidelity, diverse synthetic CXR conditioned on text prompts. We assess the model outputs quantitatively using image quality metrics, and evaluate image quality and text-image alignment by human domain experts. We present evidence that the resulting model (RoentGen) is able to create visually convincing, diverse synthetic CXR images, and that the output can be controlled to a new extent by using free-form text prompts including radiology-specific language. Fine-tuning this model on a fixed training set and using it as a data augmentation method, we measure a 5% improvement of a classifier trained jointly on synthetic and real images, and a 3% improvement when trained on a larger but purely synthetic training set. Finally, we observe that this fine-tuning distills in-domain knowledge in the text-encoder and can improve its representation capabilities of certain diseases like pneumothorax by 25%.

This paper introduces RoentGenRoentGen, a vision-language foundation model adapted for Chest X-ray (CXR) generation using a domain-adaptation strategy on a pre-trained latent diffusion model. The model is trained on a corpus of publicly available CXR images and their corresponding radiology reports. The paper evaluates the model's ability to generate high-fidelity, diverse synthetic CXR conditioned on text prompts, and assesses the outputs quantitatively using image quality metrics and qualitatively via human domain experts.

The central hypothesis is that adapting a pre-trained latent diffusion model on a corpus of CXR images and their corresponding radiology reports can overcome the distributional shift between natural and medical images, and that the resulting model can generate high-fidelity, diverse synthetic CXR images controllable via free-form text prompts including radiology-specific language.

Key results and contributions include:

  • A framework to evaluate medical domain-adapted text-to-image models using domain-specific tasks such as classification, radiology report generation, and image-image and text-image retrieval.
  • A comparison of approaches to adapt Stable Diffusion (SD) to a new CXR data distribution. Fine-tuning both the U-Net and CLIP (Contrastive Language-Image Pre-Training) text encoder yields the highest image fidelity and conceptual correctness.
  • Evidence that the original CLIP text encoder can be replaced with a domain-specific text encoder, which improves the performance of the resulting stable diffusion model after fine-tuning, particularly when the text encoder is kept frozen and only the U-Net is trained.
  • Demonstration that the SD fine-tuning task can distill in-domain knowledge to the text encoder when trained along with the U-Net, improving its representational capabilities of medical concepts such as rare abnormalities. For example, the knowledge of the model is even totally recovered or improved, if using a stronger learning rate of 1e-4 in the specific case of pneumothorax.
  • Evidence that RoentGenRoentGen can be fine-tuned on a small subset (1.1-5.5k) of images and prompts for use as a data augmentation tool for downstream image classification tasks. Training with synthetic data only performed comparably to training with real data, while training jointly on real and synthetic data improved classification performance by 5\% in the experimental setup. Furthermore, training on a larger, purely synthetic training set yielded a 3% improvement.

The authors systematically explore the domain-adaptation of an out-of-domain pretrained LDM for language-conditioned generation of medical images beyond the few- or zero-shot setting. The representational capacity of the SD pipeline was evaluated, quantified, and expanded, exploring different strategies for improving this general-domain pretrained foundational model for representing medical concepts specific to CXRs.

Generative Models for Chest X-Ray Generation

Previous approaches have been based on generative adversarial networks (GANs), have been developed for specific pathologies and consist of single-modality (imaging only) models. Two LDMs have been described for synthetic CXR generation. One approach demonstrated the feasibility of fine-tuning SD in a few-shot setting to generate synthetic images of single classes by text prompting, and that a CXR classifier trained on real radiographs was able to distinguish the inserted pathology with 95\% accuracy. The other work showed the benefits of latent diffusion models in generating models across multiple individual pathologies, comparing to GAN-based approaches. That paper focused on class-conditional image generation, and compared the performance of a classifier pretrained on real CXR data for a multi-label classification task on real and synthetic data (in the latter case showing a reduced classification performance with a mean AUROC of 72\% (-9.7%)). No quantitative or qualitative metrics were reported to evaluate the CXR generation.

No paper was found to evaluate the benefit of LDM-based synthetic CXRs to improve downstream tasks like image classification or segmentation, and no other paper attempted conditioning on text prompts.

Methods

The paper leverages the publicly available MIMIC-CXR dataset, which contains 377,110 images and associated radiology reports. The dataset was filtered to include impression sections shorter than 77 tokens. The dataset was split into "PA train" (consisting exclusively of PA views) and "PA/AP/LAT" train (all views), and two test sets, "P19 test" (PA views), and MIMIC test, using the official MIMIC split. The number of "No finding" reports were capped in each split to limit the imbalance of the dataset.

Stable Diffusion Fine-Tuning

Stable Diffusion is a pipeline of models with three main components: the variational autoencoder (VAE), a conditional denoising U-Net, and a conditioning mechanism, such as a CLIP text encoder. The architecture was not modified except for disabling the built-in "safety checker".

Previous work investigated several approaches to fine-tune the SD pipeline for CXR generation in a few-shot setting. These include "Textual Inversion" (introducing new tokens), "Textual Projection" (replacing the CLIP text encoder with a domain-specific text encoder), and "DreamBooth" (unfreezing the U-Net).

In this work, the potential of SD to be fine-tuned or retrained on medical domain-specific images and prompts was explored, leveraging a large, radiology image-text dataset. For a set of images and prompts, the VAE, the text encoder, and the U-Net were leveraged, and an MSE loss was computed to train the different components of the SD pipeline.

For each text-image pair (xtext,ypixel)(x_{text}, y_{pixel}), random gaussian noise NN gets sampled in the latent space of dimensions (h,w)(h, w):

NN(0h×w,I(h×w)2)N \sim \mathcal{N}(\mathbf{0}_{h\times w}, \mathit{\mathbf{I}_{(h\times w)^2}})

where:

  • NN is random gaussian noise
  • hh is the height of the latent space
  • ww is the width of the latent space
  • N\mathcal{N} is the normal distribution
  • 0h×w\mathbf{0}_{h\times w} is a matrix of zeros with dimensions h×wh \times w
  • I(h×w)2\mathit{\mathbf{I}_{(h\times w)^2}} is the identity matrix with dimensions (h×w)2(h \times w)^2

Using the text encoder and the VAE, both the prompt xtextx_{text} and the corresponding image ypixely_{pixel} are encoded, and sampled noise NN is added to the latent representation of the latter for a random number of timesteps tt. The U-Net processes this noisy latent representation VAE(ypixel)tN\mathit{VAE}(y_{pixel})\oplus_tN along the encoded conditioning prompt $\mathit{Enc_{text}(x_{text})$ to predict the original sampled noise N^\hat{N}:

$\hat{N} = \mathit{Unet}(\mathit{Enc_{text}(x_{text}), \mathit{VAE}(y_{pixel})\hspace{-0.2em}\oplus_t\hspace{-0.2em}N, t)$

where:

  • N^\hat{N} is the predicted noise
  • Unet\mathit{Unet} is the U-Net model
  • Enctext\mathit{Enc_{text}} is the text encoder model
  • xtext\mathit{x_{text}} is the input text prompt
  • VAE\mathit{VAE} is the variational autoencoder
  • ypixel\mathit{y_{pixel}} is the input image in pixel space
  • NN is the sampled noise
  • tt is the timestep

An MSE loss computed between the true and predicted noises NN and N^\hat{N} enables to compute gradients and improve the generation capabilities of the combined VAE, text encoder and U-Net:

L=1h×wi=0hj=0w(N^i,jNi,j)2\mathcal{L} = \frac{1}{h\times w}\sum_{i=0}^{h}\sum_{j=0}^{w}(\hat{N}_{i,j} - N_{i,j})^2

where:

  • L\mathcal{L} is the MSE loss
  • hh is the height of the latent space
  • ww is the width of the latent space
  • N^i,j\hat{N}_{i,j} is the predicted noise at pixel (i,j)(i, j)
  • Ni,jN_{i,j} is the true noise at pixel (i,j)(i, j)

The VAE component was kept frozen, and the experimental effort focused on exploring the U-Net component (fine-tuning or retraining from scratch) and the text encoder (frozen, unfrozen and trained jointly with the U-Net, or replaced with a domain-specific text encoder).

Experiments were conducted on 64 A100 GPUs. Models were mostly trained in bf16 precision. At an image resolution of 512x512 px, models were trained with a batch size of 256. Model weights for the SD pipeline (version 1.4) were obtained from the repository "CompVis/stable-diffusion-v1-4". The code implementation was built on both the diffusers library and the ViLMedic library. Two domain-specific text encoders were used: RadBERT and SapBERT. In the experiments, guidance scale 4 and 75 inference steps with a PNDM noise scheduler enabled the generation of synthetic images properly conditioned on the associated prompts.

Fidelity and diversity of generated images

Fidelity was assessed using the Fréchet Inception Distance (FID) calculated from intermediate layers of three models: InceptionV3, CLIP-ViT-B-32, and an in-domain classification model trained to detect common pathologies in CXR (DenseNet-121, XRV). Generation diversity was assessed by calculating the pairwise multi-scale structural similarity index metric (MS-SSIM) of four generated samples per prompt.

1k training steps improved the FID scores on the two baseline approaches, original SD and DreamBooth SD. As the number of training steps grew to 12.5k, FID scores slightly deteriorated. 60k steps provided the best quality of results when using the learning rate 5e-5\text{5e-5}, with an FID\textsubscript{XRV} of 3.6 and an FID\textsubscript{IncepV3} of 54.9. Over 1k training steps, randomly initializing the U-Net and training it along the text encoder led to a 50% deterioration compared to the continuous fine-tuning equivalent. After 60k steps, the randomly-initialized U-Net variant achieves an FID\textsubscript{XRV} of 4.9. Training the U-Net only, from a random initialization, showed limitations and only achieved an FID\textsubscript{XRV} of 16.5, whereas training the U-Net only from the original SD approach yielded an FID\textsubscript{XRV} of 9.2. After 60k training steps, the model using SapBERT achieved an FID\textsubscript{XRV} of 6.0, with RadBERT an FID\textsubscript{XRV} 6.7, compared to the random U-Net only model that only scored an FID\textsubscript{XRV} of 16.5.

Factual correctness of generated images

To test the generative models, pre-trained multimodal models were leveraged to benefit from an evaluation at the intersection of vision and language. Models that either generate text from images or encode medical text were used to report semantic, fine-grained evaluations.

Multi-label classification

Using the impression sections from the p19 test set, different fine-tuned SD models were queried to produce synthetic images that reflect the abnormalities of the corresponding impression sections, as labeled by CheXpert. A pre-trained classification model (DenseNet-121, XRV) was used to classify both the real images and the synthetic images. The original SD pipeline yields an AUROC of approximately 0.5, and a few-shot trained model ("DreamBooth SD") only scores a (filtered average) AUROC of 0.61. Fine-tuning with lr=5e-5lr=\text{5e-5} for 1k steps achieved a filtered average AUROC of 0.82 versus an AUROC of 0.81 for 60k steps, the difference being even larger with lr=1e-4lr=\text{1e-4}.

Radiology Report Generation

The task of Radiology Report Generation (RRG) consists of building assistive systems that take X-ray images of a patient and generate a textual report describing clinical observations in the images. A model pre-trained on MIMIC-CXR was leveraged. Images were generated using the models for every ground-truth impression of the MIMIC-CXR test set, these generated images were input in the pre-trained model that outputs new impressions, and these new impressions were compared with the ground-truth impression used to generate the images.

Zero-shot Image-Image Retrieval

This evaluation is similar to the conventional content-based image retrieval setting in which images of a particular category are searched for using a representative query image. A group of query images and a larger collection of candidate images, each with a class label, are given to a pretrained CNN encoder. Each query and candidate image is encoded with this encoder, and then for each query, all candidates are ranked by their cosine similarities to the query in descending order.

Zero-shot Image-Text Retrieval

This task is similar to the Image-Image scenario, with the difference that a query image embedding is mapped into a textual embedding space to retrieve the most likely impression given the image.

Qualitative evaluation

Two radiologists were asked to review and rate blinded pairs of true and synthetic images, and pairs of synthetic images and original prompts. The average ratings given by the two radiologists were on average 1.67±0.631.67\pm 0.63 and 1.81±0.461.81\pm0.46. The second experiment yielded average ratings of 0.41±1.410.41\pm 1.41 and 0.29±1.360.29\pm1.36.

Data augmentation

To investigate the added value of creating synthetic CXR, a DenseNet-121 classifier was trained from scratch on varying splits of real training data (R) and synthetic data (S). The task was a multi-label classification of six findings (cardiomegaly, edema, pleural effusion, pneumonia, pneumothorax and 'no finding').

Training exclusively on 1.1k synthetic images derived from a model fine-tuned on the small dataset, a drop of 0.04 in AUROC was observed compared to the baseline. Training exclusively on 5×\times the initial amount of synthetic data yielded a small improvement over the baseline (AUROC +0.02). Augmenting real data from the small dataset with the same amount of synthetic data (1.1k) led to a moderate improvement (AUROC +0.04), but further augmenting the small dataset with 5.5k synthetic samples led to a smaller increase. Adding more training data (30k) improved the classification performance (AUROC +0.09), as did training exclusively on 30k synthetic samples trained on the larger dataset (AUROC +0.07). Finally, the highest improvement in classification performance was reached by training on a combination of real and synthetic data (AUROC +0.11 vs. AUROC 0.73 in the baseline setup).

Distilling in-domain knowledge and potential catastrophic forgetting

By fine-tuning the U-Net alone or both the U-Net and the text encoder on the Chest X-ray domain, in-domain knowledge can be distilled into the components of the SD model. CheXpert@10CheXpert@10 scores measure the in-domain knowledge of the text encoder. Fine-tuning both the text encoder and the U-Net accelerates the learning of in-domain concepts but also the forgetting of the previous domain knowledge. The SD task can improve the performance of the text encoder on an in-domain task, in this case as measured by the macro-averaged CheXpert@10CheXpert@10 score.

Limitations

Limitations of the proposed approach include:

  1. The CXR images generated by RoentGenRoentGen are images and not actual radiographs, and come with a limited range of gray-scale values, preventing the use of operations like realistic windowing.
  2. Only one dataset (MIMIC-CXR), from a single institution, was used to fine-tune and evaluate RoentGenRoentGen.
  3. Only the impression sections from the radiology reports associated with each image were used to train the model.
  4. The model was prone to overfitting when trained on small datasets of a few hundreds of images.
  5. Fine-tuning both the U-Net and the text encoder leads to catastrophic forgetting phenomenons.

Conclusion and future work

The latent diffusion model Stable Diffusion can be domain-adapted to generate high-fidelity yet diverse medical CXR images. The best-performing model allows fine-grained control over the generated output by using free-form, natural language text as input, including relevant medical vocabulary.

The best performance was observed after jointly fine-tuning both pretrained U-Net and the text encoder. Replacing the frozen CLIP text encoder with a domain-specific text encoder improves performance when training the U-Net from-scratch. We developed an evaluation framework that can assess medical correctness of synthetic images with various downstream applications, such as radiology report generation or image-image and image-text retrieval. Stable diffusion fine-tuning can distill in-domain knowledge into its components, in particular the text-encoder, improving its representation capabilities on in-domain data.

Future research will focus on expanding the work to other paper types and modalities, furthering the medical information a fine-tuned stable diffusion model could retain. In particular, further investigation of fine-tuning strategies that would allow to limit catastrophic forgetting will be performed.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
Citations (85)