MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations (2503.01019v3)

Published 2 Mar 2025 in cs.CV and cs.AI

Abstract: Despite significant progress in Vision-Language Pre-training (VLP), current approaches predominantly emphasize feature extraction and cross-modal comprehension, with limited attention to generating or transforming visual content. This gap hinders the model's ability to synthesize coherent and novel visual representations from textual prompts, thereby reducing the effectiveness of multi-modal learning. In this work, we propose MedUnifier, a unified VLP framework tailored for medical data. MedUnifier seamlessly integrates text-grounded image generation capabilities with multi-modal learning strategies, including image-text contrastive alignment, image-text matching and image-grounded text generation. Unlike traditional methods that reply on continuous visual representations, our approach employs visual vector quantization, which not only facilitates a more cohesive learning strategy for cross-modal understanding but also enhances multi-modal generation quality by effectively leveraging discrete representations. Our framework's effectiveness is evidenced by the experiments on established benchmarks, including uni-modal tasks (supervised fine-tuning), cross-modal tasks (image-text retrieval and zero-shot image classification), and multi-modal tasks (medical report generation, image synthesis), where it achieves state-of-the-art performance across various tasks. MedUnifier also offers a highly adaptable tool for a wide range of language and vision tasks in healthcare, marking advancement toward the development of a generalizable AI model for medical applications.

Summary

The paper introduces MedUnifier, a unified vision-language pre-training framework for medical data that integrates text-grounded image generation using discrete visual representations.
MedUnifier employs discrete visual representations and a text-grounded image generation task to enable high-quality medical image synthesis and enhance multi-modal understanding.
The model achieves state-of-the-art performance across uni-modal, cross-modal, and multi-modal medical tasks, demonstrating adaptability for report generation and dataset augmentation.

The paper introduces MedUnifier, a unified vision-language pre-training (VLP) framework tailored for medical data, integrating text-grounded image generation capabilities with multi-modal learning strategies. The framework employs visual vector quantization to leverage discrete representations, enhancing both cross-modal understanding and multi-modal generation quality. The model achieves SOTA performance across uni-modal, cross-modal, and multi-modal tasks.

The MedUnifier framework incorporates learnable embeddings within a Transformer model, drawing inspiration from BLIP-2, and introduces a text-grounded image generation (TIG) loss, leveraging vector quantization for discrete visual representation learning. A novel latent adapter connects the base model with the image generation module, enabling co-training with image-text contrastive (ITC), image-text matching (ITM), and image-grounded text generation (ITG) losses.

The main contributions include:

The MedUnifier framework which unifies the VLP paradigm with language-guided visual generation
Discrete visual representation learning with a bridging design to guide visual outputs and enhance data interpretation
Performance enhancements on Chest X-rays across uni-modality, cross-modality, and multi-modality tasks
Adaptability in generating realistic medical images and reports, augmenting out-of-distribution datasets
A TIG module to capture fine-grained details by recovering pixel-level information from hierarchical multi-modal representations

The paper discusses related work in two primary areas: VLP and text-to-image (T2I) generation. Existing VLP models are categorized into those using uni-modal encoders and those using fusion encoder-based structures. The paper posits that current VLP approaches often lack consideration for generating visual information and exploration of detailed vision content. Recent T2I approaches using GANs and auto-regressive transformers, variational auto-encoders (VAEs), vector quantized VAEs (VQ-VAEs), and diffusion models are also discussed. The paper chooses VQ-VAEs to learn robust representations, enhancing the quality and efficiency of medical image generation.

Method

The MedUnifier framework aggregates four key learning objects on Med-VLP. The entire pre-training objective function is defined as

$\mathcal{L}_{total} = \sum_{m=1}^{M} \lambda_{m} \mathcal{L}_{m} (\mathcal{H}_{m}(\mathcal{F}(X^{I}, X^{T})))$

where

$\mathcal{F}$ represents backbone that takes the paired $[x^i, x^t]$ as input. $\mathcal{H}_m$ stands for task-specific modules for further encoding visual and textual features. $\mathcal{L}_m$ and $\lambda_m$ are different loss functions and their weights for the overall loss calculation with the total number of loss functions being $M$ .

The model consists of an image-text encoder, a text generator, and an image generator with cross-attention layer, masking strategies and vector discretization.

Image-text encoder

A BERT-styled Transformer is used as the image-text encoder network. The input contains learnable embeddings and clinical reports tokenized by words. Input images are processed into a set of patch embeddings using a pre-trained, frozen Vision Transformer (ViT). The initial visual embeddings engage with the image-text encoder network through cross-attention layers.

Text generator

The text encoder of the image-text encoder is duplicated as a language-generative network with shared weights. A decoding head is added to map each word token embedding to the vocabulary dictionary.

Image generator

A vector-quantized variational auto-encoder (VQ-VAE) is integrated within a cross-modal interactive fusion framework to generate high-quality synthetic visual content.

Given an image $x_i \in \mathbb{R}^{C \times H \times W}$ , the entire image is divided into $L_v$ patches with spatial size $(h, w)$ , and learnable positional encodings are added:

$\boldsymbol{X}^i = [\boldsymbol{p}_{[CLS]}, \boldsymbol{p}_1, \boldsymbol{p}_2, \dots, \boldsymbol{p}_{L_v}] + \boldsymbol{E}^{v}_{pos}$

where

$\boldsymbol{p} \in \mathbb{R}^{d_v}$ stands for input patch embedding $\boldsymbol{E}^{v}_{pos} \in \mathbb{R}^{(1+L_v) \times d_{v}}$ is learnable positional encodings

$L_v = \frac{H}{h} * \frac{W}{w}$

These patch embeddings get passed through a standard pre-trained ViT-g, denoted as $E_I$ , to attain preliminary visual embeddings $\boldsymbol{f}^{v}\in \mathbb{R}^{(L_{v}+1)\times d_{v}$:

$E_{I}(\boldsymbol{X}^i) = \boldsymbol{f}^{v} = [\boldsymbol{f}^{v}_{[CLS]}, \boldsymbol{f}^{v}_{local}] = [\boldsymbol{f}^{v}_{[CLS]}, \boldsymbol{f}^{v}_1, \boldsymbol{f}^{v}_2, \dots, \boldsymbol{f}^{v}_{L_v}]$

where

$\boldsymbol{f}^{v}_{[CLS]}$ is global visual feature outputs of patch embeddings $\boldsymbol{f}^{v}_{local} \in \mathbb{R}^{L_v \times d_v}$ represent local visual features

For the corresponding textual input, the input text is tokenized to word token embeddings, adding learnable positional encodings:

$\boldsymbol{X}^t = [\boldsymbol{w}_{[SPE]}, \boldsymbol{w}_1, \boldsymbol{w}_2, \dots, \boldsymbol{w}_{L_t}] + \boldsymbol{E}^{t}_{pos}$

where

$\boldsymbol{w} \in \mathbb{R}^{d_t}$ represents word token embeddings $\boldsymbol{E}^{t}_{pos} \in \mathbb{R}^{(1+L_t) \times d_{t}}$ is learnable positional encodings

To enable interaction between word token embeddings and preliminary visual embeddings, a set of learnable embeddings, denoted as $\boldsymbol{Q} = [\boldsymbol{q}_1, \boldsymbol{q}_2, \dots, \boldsymbol{q}_{L_q}], \boldsymbol{Q}\in \mathbb{R}^{L_q\times d_{q}$ are constructed. Word token embeddings and learnable embeddings of the same feature dimension are unified, e.g. $d_t = d_q$ . Then, $\boldsymbol{Q}$ and $\boldsymbol{X}^t$ are concatenated to form the input of the image-text encoder, denoted as $E_Q$ , encoding it to get output embeddings:

$E_{Q}([\boldsymbol{Q}, \boldsymbol{X}^t]) = [\boldsymbol{f}^{q}, \boldsymbol{f}^{t}] = [\boldsymbol{f}^{q}, \boldsymbol{f}^{t}_{[SPE]}, \boldsymbol{f}^{t}_{local}]$

where

$\boldsymbol{f}^{q}\in \mathbb{R}^{L_q\times d_{q}}$ is learned embeddings

$\boldsymbol{f}^{t}_{[SPE]}\in \mathbb{R}^{d_{t} }$

$\boldsymbol{f}^{t}_{local} \in \mathbb{R}^{L_t\times d_{t}}$ represent special text representation and all word token embeddings, respectively

Image-text contrastive learning (ITC)

This task aligns visual and textual representations by maximizing their mutual information through a contrastive approach. The pairwise similarity between each visual and textual representation $\boldsymbol{g}^{q}$ and $\boldsymbol{g}^{t}$ is computed. The highest one is chosen as the image-text similarity to calculate bi-directional contrastive loss:

$\mathcal{L}^{(q|t)}_{itc} = \frac{1}{N}\sum_{k=1}^{N}-\log(\frac{\exp(\max<\boldsymbol{g}^{q}_{k}, \boldsymbol{g}^{t}_{k}> / \tau)}{\sum_{n=1}^{N}\exp(\max<\boldsymbol{g}^{q}_{k}, \boldsymbol{g}^{t}_{n}> / \tau)})$

$\mathcal{L}^{(t|q)}_{itc} = \frac{1}{N}\sum_{k=1}^{N}-\log(\frac{\exp(\max<\boldsymbol{g}^{q}_{k}, \boldsymbol{g}^{t}_{k}> / \tau)}{\sum_{n=1}^{N}\exp(\max<\boldsymbol{g}^{q}_{n}, \boldsymbol{g}^{t}_{k}> / \tau)})$

where

$\tau \in \mathbb{R}$ is a scaling temperature parameter initialized to 0.07 $N$ is mini-batch size and $â¨\cdot, \cdotâ©$ represents the cosine similarity

The overall ITC loss is defined as:

$\mathcal{L}_{itc} = \frac{1}{2}(\mathcal{L}^{(q|t)}_{itc} + \mathcal{L}^{(t|q)}_{itc})$

Image-text matching (ITM)

This task learns a precise alignment between visual and textual representations by training a model to classify image-text pairs as either positive or negative in a binary classification framework. The Image-Text Matching (ITM) loss is computed as:

$\mathcal{L}_{itm} = \frac{1}{N}\sum_{k=1}^{N} -\log(p(Y_k|\hat{Y_k}))$

where $\hat{Y}$ is defined as:

$\hat{Y} = \frac{1}{L_q}\sum_{i=1}^{L_q}\mathcal{H}_{itm}(\boldsymbol{f}^{q}_i)$

$Y$ represents ground truth labels within mini-batch by hard negative samples mining.

Image-grounded text generation (ITG)

This task trains the model to generate text conditioned on paired images using causal LLMing (CLM). The learning objective is formalized as:

$\mathcal{L}_{itg} = \frac{1}{N L_t}\sum_{k=1}^{N} \sum_{i=1}^{L_t}-\log (p_i)$

where $p_i$ is defined as:

$p_i = Softmax(\mathcal{H}_{itg}(\boldsymbol{f}^t_{local})) = p(\boldsymbol{w}_{i}|\boldsymbol{Q}, \dots, \boldsymbol{w}_{i-1})$

Text-grounded image generation (TIG)

The TIG module is designed for the text-grounded image generation task, integrated into the image-text encoder and text generator. At the top level, a latent adapter, denoted as $\mathcal{Z}_{top}$ , transforms $\boldsymbol{z}^{top}$ into spatial feature map $\boldsymbol{z}_{e}^{top}$ :

$\boldsymbol{z}_{e}^{top} = \mathcal{Z}_{top}(\boldsymbol{z}^{top})$

where $\mathcal{Z}_{top}$ consisted of a nonlinear transformation, spatial positional encoding summer and a residual block. Followed by a vector quantization layer with a latent embedding space $e^{top}$ , a discrete feature map $\boldsymbol{z}_{q}^{top}$ is derived:

$\boldsymbol{z}_{q}^{top} = quantizer_{top}(\boldsymbol{z}_{e}^{top})$

At the bottom level, a latent adapter and vector quantizer with a latent embedding space $e^{bottom}$ are deployed to gain a discrete feature map:

$\boldsymbol{z}_{e}^{bottom} = \mathcal{Z}_{bottom}(\boldsymbol{z}^{bottom})$

$\boldsymbol{z}_{q}^{bottom} = quantizer_{bottom}(\boldsymbol{z}_{e}^{bottom}, \boldsymbol{z}_{q}^{top})$

A hierarchical decoder $\mathcal{D}$ with deconvolutional layers is built to recover raw visual input from discrete multi-modal representations:

$\hat{x}^i = \mathcal{D}(\boldsymbol{z}_{q}^{top}, \boldsymbol{z}_{q}^{bottom})$

The text-grounded image generation (TIG) loss is formulated as:

$\mathcal{L}_{tig} = \frac{1}{N}\sum_{k=1}^{N}-\log p(x^i_k|\boldsymbol{z}_{q}^{top}, \boldsymbol{z}_{q}^{bottom}) + \left\Vert \text{sg}[\boldsymbol{z}_{e}^{top}]-e^{top} \right\Vert_2 +\beta_1 \left\Vert \text{sg}[e^{top}]-\boldsymbol{z}_{e}^{top} \right\Vert_2 + \left\Vert \text{sg}[\boldsymbol{z}_{e}^{bot}]-e^{bot} \right\Vert_2 +\beta_2 \left\Vert \text{sg}[e^{bot}]-\boldsymbol{z}_{e}^{bot} \right\Vert_2$

where

the negative logarithmic term can be written as mean square error (MSE) $\left\Vert x^i_k - \hat{x}^i_k\right\Vert_2$ $\text{sg}[\cdot]$ is gradient stop operation Hyper-parameters $\beta_1, \beta_2$ are both set to be 0.5

Total learning objectives

The ultimate objective function is:

$\mathcal{L}_{total} = \lambda_1\mathcal{L}_{itc} + \lambda_2\mathcal{L}_{itm} + \lambda_3\mathcal{L}_{itg} + \lambda_4\mathcal{L}_{tig}$

The four loss weights $\lambda$ were set to 1 in experiments.

Experiments

The pre-training is performed on the MIMIC-CXR v2.0.0 dataset, and the model is evaluated on various downstream tasks.

Implementation details

A BERT model is used as the primary network for the image-text encoder, and ViT-g is used as the pre-trained ViT. The input image resolution is set to $224 \times 224$ , with a maximum text length of 95 tokens and 32 learnable embeddings. For optimization, the AdamW optimizer is applied with specific parameters.

Medical Vision-and-Language Benchmark

The effectiveness of the proposed method is assessed across uni-modal, cross-modal, and multi-modal tasks. The experiments are conducted on MedUnifier with/without TIG loss and compared with previous studies.

Results and Analyses

The model outperforms prior studies on uni-modal tasks across various downstream datasets. For cross-modal retrieval, the MedUnifier model achieves the highest performance. The MedUnifier demonstrates better performance on zero-shot classification tasks for both the MIMIC 5x200 and RSNA datasets. The models, both with and without the TIG module, surpass previous methods for image-grounded medical report generation. The Med-VLP framework also gains substantial advantages from incorporating causal LLMing. The reconstructed visual samples are nearly indistinguishable from authentic radiographs, and synthetic samples generated from multi-modal priors demonstrate high diversity.

Ablation paper

An ablation paper across various learning objectives indicates that using only the ITC loss yields the lowest performance. The model with TIG surpasses the one with ITG. The integration of all objective types enables the model to achieve optimal performance.

Conclusion

The paper introduces MedUnifier, a unified Med-VLP model, which optimizes four distinct learning objectives simultaneously. The framework circumvents the need to learn visual embeddings from scratch and reconstructs pixel-level visual details from both image and report. The proposed method effectively complements existing Med-VLP frameworks and achieves SOTA performance on various downstream tasks.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers

Authors (5)

Tweets

https://twitter.com/OpenlifesciAI/status/1897390490540236843

YouTube

Show All Videos

MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations (2503.01019v3)

Summary

Related Work

Method

Image-text encoder

Text generator

Image generator

Image-text contrastive learning (ITC)

Image-text matching (ITM)

Image-grounded text generation (ITG)

Text-grounded image generation (TIG)

Total learning objectives

Experiments

Implementation details

Medical Vision-and-Language Benchmark

Results and Analyses

Ablation paper

Conclusion

Follow-up Questions

Related Papers

Authors (5)

Tweets

YouTube