Instruct-Imagen: Image Generation with Multi-modal Instruction (2401.01952v1)

Published 3 Jan 2024 in cs.CV, cs.AI, and cs.CL

Abstract: This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce multi-modal instruction for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.

References (52)

Authors (12)

Hexiang Hu (48 papers)
Kelvin C. K. Chan (34 papers)
Yu-Chuan Su (22 papers)
Wenhu Chen (134 papers)
Yandong Li (38 papers)
Kihyuk Sohn (54 papers)
Yang Zhao (382 papers)
Xue Ben (3 papers)
Boqing Gong (100 papers)
William Cohen (11 papers)
Ming-Wei Chang (44 papers)
Xuhui Jia (22 papers)

Citations (26)

View on Semantic Scholar

Summary

The paper introduces a multi-modal instruction framework that integrates text, edge, style, and subject cues for versatile image generation.
It employs a cascaded diffusion model with cross-attention and a two-stage training process enhanced by retrieval-augmented data selection.
The evaluation demonstrates that Instruct-Imagen meets or exceeds prior benchmarks and effectively generalizes to unseen, complex tasks.

Introduction

The proposed model, known as Instruct-Imagen, is an innovative approach to heterogeneous image generation tasks, which showcases remarkable capabilities in generalizing across a variety of generation intents and even advancing into uncharted territories. This model stands apart by leveraging a multi-modal instruction framework that uses natural language as a guiding tool to incorporate different kinds of modalities (like text, edge, style, and subject), tightly bundling them in a coherent format that machines can interpret and execute with precision.

Multi-modal Instructions for Generation

Instruct-Imagen is built upon the concept of integrating various modalities into a unified instruction format. This is orchestrated using a retrieval-augmented training approach which selects relevant image and text pairs from a large dataset, improving the model's grounding in external multi-modal context. This step is crucial as it primes the model for the subsequent fine-tuning stage where it learns to process instructions that include multiple modes of information. The new instruction paradigm put forth in this paper enables an intuitive user interface for setting up image generation tasks that previously required specialized and ad-hoc designs.

Instruct-Imagen Model and Training

At the core of Instruct-Imagen lies a cascaded diffusion model architecture enhanced with a cross-attention layer specifically tailored to handle multi-modal instructions. The model is first trained to recognize and generate images from text, while looking at neighboring contexts provided through web-scale image-text pair clusters. In the second stage, the model is fine-tuned with a diverse set of image generation tasks that come paired with their corresponding multi-modal instructions. This dual-stage training process is crucial for the model to internalize and execute on the nuanced multi-modal instructions it receives.

Evaluation and Contributions

Instruct-Imagen demonstrates its prowess through extensive human evaluations, where it consistently meets or exceeds the performance of prior models within their domains and illustrates a strong generalization aptitude for dealing with unseen complex tasks. The contributions are multifaceted, introducing multi-modal instruction as a new task representation paradigm, a training framework to adapt pre-trained models to follow these instructions, and the creation of a unified model capable of mastering a wide array of heterogeneous image generation tasks, and importantly, extending to new tasks without the need for domain-specific design adjustments.

The evaluation is rigorous and multifaceted, involving both in-domain tasks and zero-shot evaluations, with human evaluators benchmarking the model across different scenarios. By maintaining consistent quality and alignment with user intentions, Instruct-Imagen establishes a new standard for image generation using artificial intelligence. The authors plan to make their evaluation suite available to the public, fostering further research and comparison within the AI community.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ducha_aiki/status/1743264397022077122

https://twitter.com/fly51fly/status/1743762061690388840

https://twitter.com/javaeeeee1/status/1743249777674928404

https://twitter.com/knishimae0531/status/1743112896786407735

https://twitter.com/Saquibclimatex/status/1743113015514542119

https://twitter.com/arxivsanitybot/status/1743625245183950912

YouTube

Show All Videos