The paper introduces a novel dual-branch diffusion model named Dual Diffusion Transformer (D-DiT) that unifies image and text diffusion for both text-to-image (T2I) generation and image-to-text (I2T) tasks. The model uses a joint denoising diffusion training loss. This end-to-end cross-modal diffusion model is based on the multimodal diffusion transformer (MM-DiT) architecture. The model achieves competitive performance on image generation, captioning, and visual question answering tasks, improving the capabilities of prior multimodal diffusion models.
Here's a more detailed breakdown:
- The paper addresses the limitation of existing diffusion models in visual understanding tasks compared to autoregressive vision-LLMs.
- The core idea is to leverage a cross-modal maximum likelihood estimation framework to simultaneously train the conditional likelihoods of both images and text under a single loss function.
- The D-DiT model is based on the MM-DiT architecture, modified to output diffusion targets on both image and text modalities. Continuous latent space diffusion is performed on the image branch, while discrete masked token diffusion is used on the text branch.
- A joint training objective is proposed that combines continuous and discrete diffusion. Flow matching is used for learning the conditional distribution of images, and masked diffusion is used for learning the conditional distribution of texts. The overall dual modality training loss is a weighted combination of the single modality diffusion losses:
where:
- is the dual modality training loss
- is the image diffusion loss
- is the text diffusion loss
- is a hyperparameter
- Three types of sampling-based inference are introduced: text-to-image generation, image-to-text generation, and image-to-text in-filling.
- Experiments were conducted to evaluate the performance of the proposed model on multi-modal understanding and text-to-image generation tasks. The model was trained in three stages on publicly available datasets.
- For text-to-image generation, classifier-free guidance (CFG) is used.
- The visual understanding capabilities of D-DiT are evaluated using question answering benchmarks such as VQAv2, VizWiz, OKVQA, GQA, POPE, and MME. D-DiT, as a diffusion-only multi-modal model, already boosts performance that is competitive with recent I2T + T2I models.
- It was demonstrated that the fine-tuned D-DiT preserves the performance of the original SD3 model and improves on some metrics such as colors after joint training.
- Ablation studies were conducted to analyze the impact of different components and configurations on the model's performance.
- The paper compares D-DiT against other multi-modal models, including I2T-only and I2T + T2I models. The results indicate that D-DiT compares favorably with the latter category.
The paper highlights the potential of diffusion models as efficient multi-modal models, with the proposed D-DiT model achieving promising results on a range of vision-language tasks.