MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data (2406.18790v2)

Published 26 Jun 2024 in cs.CV and cs.AI

Abstract: We train a model to generate images from multimodal prompts of interleaved text and images such as "a <picture of a man> man and his <picture of a dog> dog in an <picture of a cartoon> animated style." We bootstrap a multimodal dataset by extracting semantically meaningful image crops corresponding to words in the image captions of synthetically generated and publicly available text-image data. Our model, MUMU, is composed of a vision-LLM encoder with a diffusion decoder and is trained on a single 8xH100 GPU node. Despite being only trained on crops from the same image, MUMU learns to compose inputs from different images into a coherent output. For example, an input of a realistic person and a cartoon will output the same person in the cartoon style, and an input of a standing subject and a scooter will output the subject riding the scooter. As a result, our model generalizes to tasks such as style transfer and character consistency. Our results show the promise of using multimodal models as general purpose controllers for image generation.

Citations (1)

View on Semantic Scholar

Summary

The paper presents MUMU, which integrates vision-language models with advanced diffusion techniques to boost control and precision in image generation.
The authors construct a novel multimodal dataset by aligning image crops with text captions using open-vocab object detection.
Experimental results show improved detail preservation, robust style transfer, and effective harmonization across diverse prompt inputs.

MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data

The paper "MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data" presents an innovative approach to multimodal image generation, leveraging the strengths of both vision-LLMs (VLMs) and advanced diffusion models. This paper introduces MUMU, a new model capable of interpreting and synthesizing multimodal prompts that combine text and image inputs. The principal contributions of this paper include the generation of a novel multimodal dataset and the development of an architecture that seamlessly integrates VLMs with state-of-the-art diffusion techniques.

Overview

The motivation behind MUMU stems from the limitations inherent in text-only prompts, which often fall short in precisely capturing user intent for text-to-image generation models. By incorporating images as part of the prompt, the MUMU framework allows for significantly enhanced control and specificity in the generation process. This model is particularly adept at translating diverse styles and maintaining character consistency across varied prompts, thus offering a substantial improvement in user-directed image synthesis tasks.

Methodology

MUMU's architecture is built upon a robust combination of SDXL, a cutting-edge diffusion model, and Idefics2, a highly capable vision-LLM. The authors replaced the conventional CLIP text encoder in SDXL with Idefics2 to enable the processing of multimodal inputs. The multimodal dataset was meticulously bootstrapped using open vocab object detection techniques to extract image segments aligned with words in the captions, resulting in a highly structured and semantically rich training set.

Key innovations include:

Multimodal Dataset Construction: Utilizing synthetic and realistic image data, the authors created a comprehensive multimodal dataset by augmenting text prompts with semantically relevant image crops. This approach ensures a rich and varied training environment.
Model Architecture: MUMU's architecture benefits from the strengths of Idefics2's vision transformer and SDXL's powerful diffusion capabilities. By sacrificing the perceiver transformer in Idefics2, the authors increased the number of tokens, improving detail preservation and image quality.
Training Regimen: A two-stage training process on an 8xH100 GPU node was used to fine-tune the model. This involved both LoRA and full training paradigms to optimize performance while managing computational resources.

Key Findings

The paper's experimental results demonstrate several significant findings:

Detail Preservation: Increasing the number of tokens per image notably improves the model's ability to preserve details in the generated images.
Harmonization Ability: MUMU effectively integrates conditioning images from diverse sources into cohesive outputs, demonstrating robust harmonization capabilities.
Style Transfer: The model exhibits a promising capacity for style transfer, though human faces in abstract styles remain a challenge.
Compatibility with Community Fine-Tunes: MUMU can integrate with community-created SDXL fine-tunes without further specialized training, indicating versatility and adaptability of the model.

Implications and Future Directions

The practical implications of MUMU are substantial, enabling more nuanced and directed image generation for a wide range of applications, from creative industries to user-generated content platforms. Theoretically, this research pushes the boundary in multimodal learning by showcasing the potential of VLMs as general purpose controllers in diffusion-based image generation.

Future research could explore several avenues:

Scaling: Full training without reliance on LoRA and leveraging larger datasets can further enhance model performance.
Enhanced Tokenization: Developing more sophisticated image tokenization methods tailored for generation tasks may improve fine detail preservation.
Broader Multimodal Inputs: Expanding the breadth of multimodal inputs to include other data types, such as spatial information or more complex object interactions, could further refine model outputs and usability.

In conclusion, the MUMU framework represents a pivotal step towards more controlled and expressive generative AI, demonstrating the feasibility and benefits of multimodal prompting in image generation tasks. The findings and methodologies proposed in this paper lay a strong foundation for future advancements in the field.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (2)

Tweets

https://twitter.com/multimodalart/status/1806698753270718913

https://twitter.com/wangbudui/status/1806629048183042316

https://twitter.com/LukeW/status/1868817457466765507

https://twitter.com/AdeenaY8/status/1806704451425599671

https://twitter.com/LukeW/status/1943347248575975480

https://twitter.com/LukeW/status/1943442447864074495