I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models (2502.10458v1)

Published 12 Feb 2025 in cs.LG and cs.AI

Abstract: This paper presents ThinkDiff, a novel alignment paradigm that empowers text-to-image diffusion models with multimodal in-context understanding and reasoning capabilities by integrating the strengths of vision-LLMs (VLMs). Existing multimodal diffusion finetuning methods largely focus on pixel-level reconstruction rather than in-context reasoning, and are constrained by the complexity and limited availability of reasoning-based datasets. ThinkDiff addresses these challenges by leveraging vision-language training as a proxy task, aligning VLMs with the decoder of an encoder-decoder LLM instead of a diffusion decoder. This proxy task builds on the observation that the $\textbf{LLM decoder}$ shares the same input feature space with $\textbf{diffusion decoders}$ that use the corresponding $\textbf{LLM encoder}$ for prompt embedding. As a result, aligning VLMs with diffusion decoders can be simplified through alignment with the LLM decoder. Without complex training and datasets, ThinkDiff effectively unleashes understanding, reasoning, and composing capabilities in diffusion models. Experiments demonstrate that ThinkDiff significantly improves accuracy from 19.2% to 46.3% on the challenging CoBSAT benchmark for multimodal in-context reasoning generation, with only 5 hours of training on 4 A100 GPUs. Additionally, ThinkDiff demonstrates exceptional performance in composing multiple images and texts into logically coherent images. Project page: https://mizhenxing.github.io/ThinkDiff.

Summary

The paper introduces ThinkDiff, which transfers reasoning abilities from vision-language models to diffusion decoders using an effective alignment framework.
It employs two variants—ThinkDiff-LVLM and ThinkDiff-CLIP—to significantly boost reasoning accuracy from 19.2% to 46.3% on the CoBSAT benchmark.
The approach simplifies training processes while expanding diffusion model applications for complex text-to-image generation with multimodal inputs.

Introduction

The paper "I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models" (2502.10458) introduces ThinkDiff, a paradigm designed to enhance the capabilities of text-to-image diffusion models by integrating multimodal in-context reasoning using Vision-LLMs (VLMs). Traditional diffusion models excel at generating high-quality images from explicit prompts but lack reasoning capabilities. ThinkDiff addresses this gap by aligning VLMs with the decoders of encoder-decoder LLMs, simplifying the training process and enhancing the reasoning ability of diffusion models.

Figure 1: Reconstruction-based diffusion finetuning focuses on pixel-level reconstruction while ThinkDiff transfers reasoning capabilities from VLMs to a diffusion decoder.

Methodology

ThinkDiff employs an alignment framework that leverages the shared feature space between the VLMs and diffusion decoders, facilitated by an LLM encoder. By aligning the VLM with the LLM decoder through vision-language training, ThinkDiff enables the diffusion models to inherit the reasoning capabilities of the VLM.

Network Architecture

The core components of ThinkDiff include:

Source VLM: Encodes multimodal inputs and captures reasoning information.
Aligner Network: Transforms VLM features to be interpretable by the diffusion decoder.
Decoder: Utilizes an LLM decoder in training and a diffusion decoder in inference.

ThinkDiff supports two variants:

ThinkDiff-LVLM: Uses a Large Vision-LLM (LVLM) for advanced multimodal reasoning. It adopts a random masking strategy during training to prevent "shortcut mapping".
ThinkDiff-CLIP: Utilizes the vision encoder of a CLIP model, mapping semantically rich image features to diffusion decoders.
Figure 2: Several diffusion models share a language encoder with LLMs, allowing alignment with diffusion decoders through LLM decoders.

Experimental Results

Experiments reveal ThinkDiff's superiority in multimodal in-context reasoning, significantly improving accuracy on the CoBSAT benchmark. The model achieves a 46.3% accuracy, up from the previous 19.2%, using a modest training setup.

ThinkDiff-LVLM

2-shot and 4-shot Evaluations: ThinkDiff-LVLM demonstrates superior performance across most tasks on the CoBSAT benchmark, outclassing existing methods by a wide margin in reasoning accuracy.
Figure 3: Training framework of ThinkDiff-LVLM and ThinkDiff-CLIP, illustrating how multimodal inputs are processed to generate coherent outputs.

ThinkDiff-CLIP

Multimodal Input Handling: ThinkDiff-CLIP effectively integrates features from multiple images and text prompts, showcasing its ability to generate semantically coherent images even with complex inputs.
Figure 4: Generation results for single image and text inputs. ThinkDiff effectively integrates semantic details from both modalities.

Discussion

ThinkDiff demonstrates a novel approach to aligning VLMs with diffusion models by using a proxy task with LLM decoders, efficiently transferring reasoning capabilities without extensive datasets or complex training. The model's streamlined training requirements and robust reasoning performance suggest potential for broader applications in AI, such as more sophisticated visual tasks requiring logical inference.

Although successful, limitations remain, particularly when handling certain complex reasoning scenarios. Future work could aim to refine the alignment process and extend the framework to include more modalities, such as audio or video, potentially paving the way for universal multimodal generative models.

Conclusion

ThinkDiff effectively empowers diffusion models with the ability to perform in-context reasoning, bridging a significant gap in the capabilities of text-to-image diffusion models. By leveraging VLMs, ThinkDiff simplifies the training process and achieves state-of-the-art performance on logic-intensive tasks, thereby expanding the potential applications of diffusion models in AI.