Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Instruct-Imagen: Image Generation with Multi-modal Instruction (2401.01952v1)

Published 3 Jan 2024 in cs.CV, cs.AI, and cs.CL

Abstract: This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce multi-modal instruction for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  2. PaLm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  3. Improving image generation with better captions. Technical Report, 2023.
  4. VisIT-Bench: A benchmark for vision-language instruction following inspired by real-world use. arXiv preprint arXiv:2308.06595, 2023.
  5. Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
  6. Re-Imagen: Retrieval-augmented text-to-image generator. arXiv preprint arXiv:2209.14491, 2022a.
  7. Subject-driven text-to-image generation via apprenticeship learning. In NeurIPS, 2023a.
  8. PaLI: A jointly-scaled multilingual language-image model. In ICLR, 2022b.
  9. Pali-x: On scaling up a multilingual vision and language model. arXiv preprint arXiv:2305.18565, 2023b.
  10. Can pre-trained vision and language models answer visual information-seeking questions? arXiv preprint arXiv:2302.11713, 2023c.
  11. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  12. Diffusion models beat gans on image synthesis. In NeurIPS, 2021.
  13. Exploring the structure of a real-time, arbitrary neural artistic stylization network. In BMVC, 2017.
  14. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  15. Denoising diffusion probabilistic models. NeruIPS, 2020.
  16. Cascaded diffusion models for high fidelity image generation. JMLR, 2022.
  17. Open-domain visual entity recognition: Towards recognizing millions of wikipedia entities. arXiv preprint arXiv:2302.11154, 2023.
  18. Taming encoder for zero fine-tuning image customization with text-to-image diffusion models. arXiv preprint arXiv:2304.02642, 2023.
  19. Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196, 2017.
  20. ImagenHub: Standardizing the evaluation of conditional image generation models. arXiv preprint arXiv:2310.01596, 2023.
  21. Multi-concept customization of text-to-image diffusion. In CVPR, 2023.
  22. BLIP-Diffusion: Pre-trained subject representation for controllable text-to-image generation and editing. arXiv preprint arXiv:2305.14720, 2023.
  23. Photo-Sketching: Inferring contour drawings from images. In WACV, 2019.
  24. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  25. Large-scale celebfaces attributes (CelebA) dataset. Retrieved August, 2018.
  26. OK-VQA: A visual question answering benchmark requiring external knowledge. In CVPR, 2019.
  27. OpenAI. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  28. Kosmos-G: Generating images in context with multimodal large language models. arXiv preprint arXiv:2310.02992, 2023.
  29. SDXL: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  30. Basnet: Boundary-aware salient object detection. In CVPR, 2019.
  31. Learning transferable visual models from natural language supervision. In ICML, 2021.
  32. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 2020.
  33. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  34. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. TPAMI, 2020.
  35. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  36. U-Net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  37. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, 2023.
  38. Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, 2022.
  39. Progressive distillation for fast sampling of diffusion models. arXiv preprint arXiv:2202.00512, 2022.
  40. Adafactor: Adaptive learning rates with sublinear memory cost. In ICML, 2018.
  41. Deep unsupervised learning using nonequilibrium thermodynamics. In ICML, 2015.
  42. StyleDrop: Text-to-image generation in any style. arXiv preprint arXiv:2306.00983, 2023.
  43. Nima: Neural image assessment. TIP, 2018.
  44. Improved artgan for conditional synthesis of natural image and artwork. TIP, 2019.
  45. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  46. Holistically-nested edge detection. In ICCV, 2015.
  47. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021.
  48. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
  49. MagicBrush: A manually annotated dataset for instruction-guided image editing. arXiv preprint arXiv:2306.10012, 2023a.
  50. Adding conditional control to text-to-image diffusion models. In ICCV, 2023b.
  51. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023c.
  52. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Hexiang Hu (48 papers)
  2. Kelvin C. K. Chan (34 papers)
  3. Yu-Chuan Su (22 papers)
  4. Wenhu Chen (134 papers)
  5. Yandong Li (38 papers)
  6. Kihyuk Sohn (54 papers)
  7. Yang Zhao (382 papers)
  8. Xue Ben (3 papers)
  9. Boqing Gong (100 papers)
  10. William Cohen (11 papers)
  11. Ming-Wei Chang (44 papers)
  12. Xuhui Jia (22 papers)
Citations (26)

Summary

  • The paper introduces a multi-modal instruction framework that integrates text, edge, style, and subject cues for versatile image generation.
  • It employs a cascaded diffusion model with cross-attention and a two-stage training process enhanced by retrieval-augmented data selection.
  • The evaluation demonstrates that Instruct-Imagen meets or exceeds prior benchmarks and effectively generalizes to unseen, complex tasks.

Introduction

The proposed model, known as Instruct-Imagen, is an innovative approach to heterogeneous image generation tasks, which showcases remarkable capabilities in generalizing across a variety of generation intents and even advancing into uncharted territories. This model stands apart by leveraging a multi-modal instruction framework that uses natural language as a guiding tool to incorporate different kinds of modalities (like text, edge, style, and subject), tightly bundling them in a coherent format that machines can interpret and execute with precision.

Multi-modal Instructions for Generation

Instruct-Imagen is built upon the concept of integrating various modalities into a unified instruction format. This is orchestrated using a retrieval-augmented training approach which selects relevant image and text pairs from a large dataset, improving the model's grounding in external multi-modal context. This step is crucial as it primes the model for the subsequent fine-tuning stage where it learns to process instructions that include multiple modes of information. The new instruction paradigm put forth in this paper enables an intuitive user interface for setting up image generation tasks that previously required specialized and ad-hoc designs.

Instruct-Imagen Model and Training

At the core of Instruct-Imagen lies a cascaded diffusion model architecture enhanced with a cross-attention layer specifically tailored to handle multi-modal instructions. The model is first trained to recognize and generate images from text, while looking at neighboring contexts provided through web-scale image-text pair clusters. In the second stage, the model is fine-tuned with a diverse set of image generation tasks that come paired with their corresponding multi-modal instructions. This dual-stage training process is crucial for the model to internalize and execute on the nuanced multi-modal instructions it receives.

Evaluation and Contributions

Instruct-Imagen demonstrates its prowess through extensive human evaluations, where it consistently meets or exceeds the performance of prior models within their domains and illustrates a strong generalization aptitude for dealing with unseen complex tasks. The contributions are multifaceted, introducing multi-modal instruction as a new task representation paradigm, a training framework to adapt pre-trained models to follow these instructions, and the creation of a unified model capable of mastering a wide array of heterogeneous image generation tasks, and importantly, extending to new tasks without the need for domain-specific design adjustments.

The evaluation is rigorous and multifaceted, involving both in-domain tasks and zero-shot evaluations, with human evaluators benchmarking the model across different scenarios. By maintaining consistent quality and alignment with user intentions, Instruct-Imagen establishes a new standard for image generation using artificial intelligence. The authors plan to make their evaluation suite available to the public, fostering further research and comparison within the AI community.

Youtube Logo Streamline Icon: https://streamlinehq.com