Dual Diffusion for Unified Image Generation and Understanding (2501.00289v1)

Published 31 Dec 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Diffusion models have gained tremendous success in text-to-image generation, yet still lag behind with visual understanding tasks, an area dominated by autoregressive vision-LLMs. We propose a large-scale and fully end-to-end diffusion model for multi-modal understanding and generation that significantly improves on existing diffusion-based multimodal models, and is the first of its kind to support the full suite of vision-LLMing capabilities. Inspired by the multimodal diffusion transformer (MM-DiT) and recent advances in discrete diffusion LLMing, we leverage a cross-modal maximum likelihood estimation framework that simultaneously trains the conditional likelihoods of both images and text jointly under a single loss function, which is back-propagated through both branches of the diffusion transformer. The resulting model is highly flexible and capable of a wide range of tasks including image generation, captioning, and visual question answering. Our model attained competitive performance compared to recent unified image understanding and generation models, demonstrating the potential of multimodal diffusion modeling as a promising alternative to autoregressive next-token prediction models.

PDF Abstract

The paper introduces a novel dual-branch diffusion model named Dual Diffusion Transformer (D-DiT) that unifies image and text diffusion for both text-to-image (T2I) generation and image-to-text (I2T) tasks. The model uses a joint denoising diffusion training loss. This end-to-end cross-modal diffusion model is based on the multimodal diffusion transformer (MM-DiT) architecture. The model achieves competitive performance on image generation, captioning, and visual question answering tasks, improving the capabilities of prior multimodal diffusion models.

Here's a more detailed breakdown:

The paper addresses the limitation of existing diffusion models in visual understanding tasks compared to autoregressive vision-LLMs.
The core idea is to leverage a cross-modal maximum likelihood estimation framework to simultaneously train the conditional likelihoods of both images and text under a single loss function.
The D-DiT model is based on the MM-DiT architecture, modified to output diffusion targets on both image and text modalities. Continuous latent space diffusion is performed on the image branch, while discrete masked token diffusion is used on the text branch.
A joint training objective is proposed that combines continuous and discrete diffusion. Flow matching is used for learning the conditional distribution of images, and masked diffusion is used for learning the conditional distribution of texts. The overall dual modality training loss is a weighted combination of the single modality diffusion losses:

$L_{\text{dual}} = L_{\text{image}} + \lambda_{\text{text}}L_{\text{text}}$

where:
- $L_{\text{dual}}$ is the dual modality training loss
- $L_{\text{image}}$ is the image diffusion loss
- $L_{\text{text}}$ is the text diffusion loss
- $\lambda_{\text{text}}$ is a hyperparameter
Three types of sampling-based inference are introduced: text-to-image generation, image-to-text generation, and image-to-text in-filling.
Experiments were conducted to evaluate the performance of the proposed model on multi-modal understanding and text-to-image generation tasks. The model was trained in three stages on publicly available datasets.
For text-to-image generation, classifier-free guidance (CFG) is used.
The visual understanding capabilities of D-DiT are evaluated using question answering benchmarks such as VQAv2, VizWiz, OKVQA, GQA, POPE, and MME. D-DiT, as a diffusion-only multi-modal model, already boosts performance that is competitive with recent I2T + T2I models.
It was demonstrated that the fine-tuned D-DiT preserves the performance of the original SD3 model and improves on some metrics such as colors after joint training.
Ablation studies were conducted to analyze the impact of different components and configurations on the model's performance.
The paper compares D-DiT against other multi-modal models, including I2T-only and I2T + T2I models. The results indicate that D-DiT compares favorably with the latter category.

The paper highlights the potential of diffusion models as efficient multi-modal models, with the proposed D-DiT model achieving promising results on a range of vision-language tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Zijie Li (14 papers)
Henry Li (15 papers)
Yichun Shi (40 papers)
Amir Barati Farimani (121 papers)
Yuval Kluger (40 papers)
Linjie Yang (48 papers)
Peng Wang (832 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/CSVisionPapers/status/1875134122647040026

YouTube

Show All Videos