Multimodal Representation Alignment for Image Generation: Text-Image Interleaved Control Is Easier Than You Think (2502.20172v1)

Published 27 Feb 2025 in cs.CV and cs.CL

Abstract: The field of advanced text-to-image generation is witnessing the emergence of unified frameworks that integrate powerful text encoders, such as CLIP and T5, with Diffusion Transformer backbones. Although there have been efforts to control output images with additional conditions, like canny and depth map, a comprehensive framework for arbitrary text-image interleaved control is still lacking. This gap is especially evident when attempting to merge concepts or visual elements from multiple images in the generation process. To mitigate the gap, we conducted preliminary experiments showing that large multimodal models (LMMs) offer an effective shared representation space, where image and text can be well-aligned to serve as a condition for external diffusion models. Based on this discovery, we propose Dream Engine, an efficient and unified framework designed for arbitrary text-image interleaved control in image generation models. Building on powerful text-to-image models like SD3.5, we replace the original text-only encoders by incorporating versatile multimodal information encoders such as QwenVL. Our approach utilizes a two-stage training paradigm, consisting of joint text-image alignment and multimodal interleaved instruction tuning. Our experiments demonstrate that this training method is effective, achieving a 0.69 overall score on the GenEval benchmark, and matching the performance of state-of-the-art text-to-image models like SD3.5 and FLUX.

PDF Abstract

Multimodal Representation Alignment for Image Generation

The paper presents an advancement in the field of text-to-image generation by proposing a framework called Dream Engine, which seeks to address the complexity of merging text and image inputs in generative models. This work builds upon existing diffusion models, notably those leveraging text encoders like CLIP and T5, by integrating Large Multimodal Models (LMMs) to achieve seamless text-image interleaved control. The research contends that LMMs provide a unified representation space conducive to aligning text and image modalities, thereby enabling more nuanced and controlled image generation.

Key Contributions and Methodology

Framework Overview: Dream Engine departs from traditional text-to-image models that primarily involve text prompts. Instead, it introduces a framework that bridges advanced text encoders with diffusion technologies to support diverse text-image inputs for more intricate control over generated images. This is achieved by swapping the classical text-only encoders with multimodal encoders, such as QwenVL, and incorporating these into the diffusion backbone of models like Stable Diffusion 3.5.
Model Architecture: The system employs a diffusion transformer, enhanced by multimodal models to better capture and represent the complexities inherent in interpreting combined text-image inputs. A pivotal component is an adapter layer—a straightforward multi-layer perceptron (MLP)—that ensures alignment between the multimodal encoder outputs and a diffusion model's latent space.
Training Protocol: The authors propose a two-stage training paradigm to effectively tune the integrated model. The first stage focuses on aligning text and image representations while fixing the weights of the LMM and DiT modules, optimizing only the adapter. The second stage involves refining the DiT using parameter-efficient tuning (LoRA), further accommodating the intricate control scenarios possible with interleaved inputs.
Innovative Capabilities: The concept-to-detail progression observed during training reflects the model's ability to initially grasp general concepts from the combined inputs, refining details as training progresses. This characteristic underscores the effective integration of distinct multimodal elements, supporting complex tasks like object-driven generation or environment-specific object placement.

Experimental Results

The model demonstrates competitive performance on the GenEval benchmark, securing an overall score of 0.69. This is notably close to leading models like SD3.5, highlighting Dream Engine's ability to handle complex text-image interactions without sacrificing quality. Moreover, the model shows superior consistency in image reconstructions compared to peers like SeedTokenizer and Emu-2, suggesting successful alignment of text and image representations with minimal training data.

Implications and Future Directions

The implications of this research are twofold: it paves the way for more sophisticated interactions between textual and visual modalities in generative models and facilitates enhanced user control in creative and content generation applications. By proving that LMMs can replace conventional text encoders effectively, it opens avenues for future multimodal models to incorporate more dynamic elements, potentially extending into video or 3D content generation.

Looking forward, understanding how to further refine the interaction between LLMs and visual modalities without loss of fidelity or creative flexibility poses an exciting challenge. Bridging the capabilities demonstrated by Dream Engine into broader contexts could significantly influence the development of more intelligent, adaptable AI systems in creative fields and beyond.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Liang Chen (360 papers)
Shuai Bai (22 papers)
Wenhao Chai (50 papers)
Weichu Xie (2 papers)
Haozhe Zhao (19 papers)
Leon Vinci (1 paper)
Junyang Lin (99 papers)
Baobao Chang (80 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/BainanceLabs/status/1896738059565367726

https://twitter.com/Zengune/status/1896740673661116426

https://twitter.com/liangchen5518/status/1899403165260304608

YouTube

Show All Videos